Dr. Unknown, or how I learned to stop worrying and love the NULL

Dr. Unknown, or how I learned to stop worrying and love the NULL

Two months ago, I posted the first two parts of a series about NULL. After that, I went quiet. Much to do and little time was one excuse. But to be honest, I also lost interest. However, I felt I owe my readers to conclude the series, so I have now forced myself to write and publish the last parts before moving on to more interesting subjects J.

 

Before reading on, I suggest that you first read (or reread) the first and second part of this series, so that we’re all on the same level again.

 

Finished reading? Okay, good to see you back. J

 

How to turn logical into illogical

 

In his comment to the second part of this series, David Portas pointed out the danger of equating the behaviour of Unknown in SQL’s three-valued logic to the behaviour of humans when confronted with unknown (or rather: missing) data. The problem is that the behaviour of NULL and Unknown is defined such that it will mostly produce the logical result – but that defined behaviour is then in true computer-fashion applied to all situations. A computer lacks the brain power to assess when the rules stop to make sense.

 

Here are two examples where the logical treatment of NULL and Unknown, when applied to a specific situation, renders completely illogical results:

 

DECLARE @Var int;

SET @Var = NULL;

 

IF @Var = 8 OR @Var <> 8

  PRINT ‘It either is or isn”t 8’;

ELSE PRINT ‘Huh? It”s not 8, but it”s not not 8 either!’;

 

IF @Var = @Var

  PRINT ‘Duh, of course it”s equal to itself’;

ELSE PRINT ‘What the hey, it”s not even equal to itself!!’;

 

Both conditions in the code above are immediately recognisable as tautologies for us humans. But SQL Server doesn’t attempt to identify tautologies, it just does as instructed. So in the case of the first IF statement, it first evaluated each of the individual parts: “@Var = 8” is Unknown, since the value for @Var is missing and might or might not be 8. Likewise, the evaluation of “@Var <> 8” results in Unknown as well. And then the logic table that I explained in my last post is consulted to get a result of Unknown for “Unknown OR Unknown” – and hence the ELSE is executed (and the PRINTed output should actually have read “I’m not really sure whether it’s either equal or unequal to 8!” – which of course makes no sense either). Note also that changing “@Var <> 8” to “NOT(@Var <> 8)” will not change the outcome.

 

The second example takes the oddity yet a step further. Surely, anything is always to itself? Well, yeah, sure it is – but the specifications of the ANSI standard disallow SQL Server from knowing that. What SQL Server does here, is that first the variable reference is replaced by its value, so that the equation to evaluate becomes “NULL = NULL” – and there is of course no way to know if two missing values are equal to each other or not.

 

A different, slightly better hidden form of tautology is presented in this example:

 

CREATE TABLE Test

   (Col1 int NOT NULL PRIMARY KEY,

    Col2 int NULL,

    CHECK(Col2 < 10));

 

INSERT INTO Test (Col1, Col2)

SELECT 1, 3

UNION ALL

SELECT 2, NULL;

 

SELECT Col1, Col2

FROM   Test

WHERE  Col2 < 10;

go

 

For the row with Col1 = 2, the value of Col2 is missing – but because of the CHECK constraint, we know that it can never be 10 or more. And yet, SQL Server had to reject this row, based on the handling of NULL that I have been labelling as “logical” in my previous posts!

 

An idea that pops up in newsgroups from time to time, is to have the RDBMS recognise tautologies and react accordingly, regardless of whether the input is NULL or not. I must say that I am glad that this has, so far (touches wood), not been incorporated in either the ANSI standard or SQL Server. Sure, I do appreciate that the current situation leads to weird effects, but it is at least 100% logical. But extending the database with a list of recognised tautologies would introduce another weirdness, for I consider it to be highly unlikely that a database will ever be able to recognise all possible tautologies. Would you instantly recognise the WHERE clause below as a tautology?

 

WHERE Foo BETWEEN 0 AND Bar

OR    SQUARE(Foo) > Foo * Bar

OR    Bar <= 10;

 

And remember – even if you could, and if you could make a query parser do it too, I can always come up with a more complex way of saying “True”.

 

Unequal, but not distinct

 

Database developers often have to deal with testing if values are unequal – but they also have to deal with testing if values are distinct. Now you may think that “distinct” is the same as “unequal”, and you may even find that Wikipedia agrees with you – but in database land, “distinct” is in fact both distinct from and unequal to “unequal”.

 

The difference between equality and distinction, in database terms, is related to NULL values. For an equality comparison, the result is Unknown if either or both the arguments is NULL, as explained previously. For a distinction comparison, though, a NULL value is considered to be distinct from any non-NULL value, but the same as another NULL. It is easy to see this in practice – just run a query with a GROUP BY or a DISTINCT clause, and you will see that all NULL values are lumped together in a single group. (Unfortunately, Microsoft has also seen fit to use distinction rather than equality in its implementation of UNIQUE constraints, a clear violation of the ANSI standard that often forces us to implement awkward workarounds – vote here if you agree with me that this should be changed).

 

There are various occasion where a database developer will want to know if two values are distinct. A typical example would be a trigger to track changes and store them in a separate auditing table – if only two columns in a 40-column table are actually changed, you prefer not to waste log space on the other 38 columns. But a straight comparison between the old and the new value would only catch changes from a non-NULL to a different non-NULL value; changes from NULL to non-NULL and vice versa would not be audited since this comparison would result in Unknown. As would a non-change of a NULL value.

 

Unfortunately, Microsoft has not yet implemented the IS [NOT] DISTINCT FROM operator that was added in the SQL:1999 version of the ANSI standard. That means that instead of being able to write

WHERE OldValue IS DISTINCT FROM NewValue;

we are stuck with writing this more clumsy (and less efficient) variant:

WHERE OldValue <> NewValue

OR   (OldValue IS NULL AND NewValue IS NOT NULL)

OR   (OldValue IS NOT NULL AND NewValue IS NULL);

 

Implementing the IS [NOT] DISTINCT FROM clause has been suggested on Microsoft Connect two months ago by Steve Kass. There are already 49 votes in favour of this suggestion and Microsoft is considering it for SQL Server 2008 – you too can increase the chance that this will happen by adding your vote here.

 

WHERE WeAre [NOT] IN (trouble)

 

Another common cause of NULL-related woes is related to the use of [NOT] IN with a subquery. To understand the root cause of this troubles, it’s useful to first have a look at the exact meaning of [NOT] IN with a hardcoded value list.

 

In the ANSI standard, the IN clause is defined such that

WHERE Something IN (3, 5, 7);

should be considered as shorthand for

WHERE Something = 3

   OR Something = 5

   OR Something = 7;

and should therefore return the exact same result. Likewise,

WHERE Something NOT IN (3, 5, 7);

should return the same result as

WHERE NOT (Something = 3

        OR Something = 5

        OR Something = 7);

or

WHERE Something <> 3

  AND Something <> 5

  AND Something <> 7;

 

This definition is to be taken very literally. Even when a NULL is inserted in the value list, the definition still holds. So

WHERE Something IN (3, NULL, 7);

is considered as shorthand for

WHERE Something = 3

   OR Something = NULL

   OR Something = 7;

and returns the same results. If Something = 3, the result will be True (True OR Unknown OR False) and the row is included. If Something = 2, the result is Unknown (False OR Unknown OR False), and the row is not included. No surprises so far. And if Something is NULL, the result ends up as Unknown (Unknown OR Unknown OR Unknown) as well, so this row is excluded as well – this may be slightly surprising, but not really a big problem for most.

 

The nasty surprise sits in the NOT IN variant:

WHERE Something NOT IN (3, NULL, 7);

is shorthand for

WHERE Something <> 3

  AND Something <> NULL

  AND Something <> 7;

and this will never return any row at all! See, if Something is 3, the result is of course False (False AND Unknown AND True) and the row is omitted, as expected. For a NULL value of Unknown, the result is Unknown (Unknown AND Unknown AND Unknown), so this row is omitted as well. But the kicker is what happens if Something = 2 … in that case, the result is not True (as most would expect), but Unknown (True AND Unknown AND True) and this row is also omitted from the result. After all, the database can’t be sure that the missing value is accidentally equal to 2, and doesn’t want to risk returning rows that don’t qualify!

 

Mind you, I have never ever seen anyone actually use a NULL in the hardcoded value list of a NOT IN clause, so what’s the deal anyway? Well, it’s simple – the defined behaviour of a [NOT] IN (subquery) clause is to evaluate the subquery, replace it with the list of values returned from the subquery, and then evaluate the [NOT] IN according to the rules discussed above. (Note that the query optimizer is free to, and will often, choose another method of actually executing the query as long as it doesn’t change the results). So if Foo.Bar can be NULL, the result of

WHERE    Something NOT IN

 (SELECT Bar

  FROM   Foo

  WHERE  Yadda yadda);

will be just as empty as the result of

WHERE Something NOT IN (3, NULL, 7);

 

The easy workaround is of course to explicitly exclude the NULLs from the subquery:

WHERE    Something NOT IN

 (SELECT Bar

  FROM   Foo

  WHERE  Yadda yadda

  AND    Bar IS NOT NULL);

 

A different solution is to rewrite the NOT IN to a subquery with NOT EXISTS:

WHERE NOT EXISTS

 (SELECT *

  FROM   Foo

  WHERE  Yadda yadda

  AND    Bar = Something);

 

This also provides a nice workaround for the current non-existing support for row value constructors in SQL Server (also on Connect, so vote here if you deem this important). You can not write

WHERE (Something, Splunge) NOT IN

 (SELECT Bar, Splat

  FROM   Foo

  WHERE  Yadda yadda);

but you can use this:

WHERE NOT EXISTS

 (SELECT *

  FROM   Foo

  WHERE  Yadda yadda

  AND    Bar   = Something

  AND    Splat = Splunge);

 

My advice is to strictly avoid [NOT] IN (subquery) and always use [NOT] EXISTS (subquery) instead. Not only because the latter avoids the NULL issues, but also because [NOT] EXISTS can be used in cases where [NOT] IN isn’t possible. Plus, I have never seen a case where [NOT] IS performs better than [NOT] EXISTS (but I have seen the reverse).

 

There is of course nothing wrong with using [NOT] IN (value, value, …)

 

What if null if null is null null null is null?

 

I promised to answer this question – and now I see that this is already my longest blog post ever, and I’m not even near the answer yet. So I’ll just go ahead and tell you that, if null if null is null null null is null, everything is okay and the product is working as it should be, but for one caveat. Why, and what caveat, that’ll have to wait until I find the time to wrap up and post the final part of this series on NULL. (And no, I won’t keep you waiting for another two months – promise!)

Upcoming speaking events
What if null if null is null null null is null?

Related Posts

No results found.

9 Comments. Leave new

  • Great article, another reason not to use IN is when you fat-finger the column name

    Example:

    CREATE TABLE TestTable1 (id1 int)

    CREATE TABLE TestTable2 (id2 int)

    INSERT TestTable1 VALUES(1)

    INSERT TestTable1 VALUES(2)

    INSERT TestTable1 VALUES(3)

    INSERT TestTable2 VALUES(1)

    INSERT TestTable2 VALUES(2)

    SELECT *

    FROM TestTable1

    WHERE id1 IN (SELECT id1 FROM TestTable2)

    It doesn’t matter that the TestTable2 doesn’t have a id1 column, all rows will be returned

    Exist is the way to go

    SELECT *

    FROM t1

    WHERE EXISTS (SELECT * FROM TestTable2 t2 WHERE t2.id2 = t1.id1 )

    or a join

    SELECT t1.*

    FROM TestTable1 t1

    JOIN TestTable2 t2 ON t2.id2 = t1.id1

    Reply
  • Hugo Kornelis
    September 23, 2007 01:17

    Good point, Denis!

    This also shows why you should always prefix all column names with the table name (or alias) in a query that touches more than one table.

    BTW, the JOIN solution is only equivalent if TestTable2.id2 is constrained to be UNIQUE or PRIMARY KEY, otherwise it might generate duplicate rows that the other queries don’t produce.

    Reply
  • James Luetkehoelter
    September 24, 2007 15:13

    Any excellent post Hugo – ironically I was just about to post my next NULL post (hmm…that would make a good April Fool’s joke…gotta remember that).

    Reply
  • James Luetkehoelter
    September 24, 2007 15:13

    er, another, not any 🙂

    Reply
  • Further to your UNIQUE constraint note above, I have just spoken with some Microsoft techs at the PASS conference about precisely the same request (allow multiple NULL values in a UNQUE constraint).  SQL Server 2008 will allow filtered indexes and the UNIQUE keyword is valid.  You can test this in the July CTP.

    Reply
  • Hugo Kornelis
    September 30, 2007 23:29

    Hi Michael,

    Good to know that there will be some way available to us to allow multiple NULLs in a UNIQUE constraint – though I still think that this should be the default behaviour of UNIQUE constraint, as per ANSI standard.

    But if I can’t have full ANSI compatibility, I’ll grab this with all might! <g>

    Reply
  • Hugo Kornelis
    October 1, 2007 09:17

    The fourth (and final) part of this series about  NULL is now available at http://sqlblog.com/blogs/hugo_kornelis/archive/2007/09/30/what-if-null-if-null-is-null-null-null-is-null.aspx.

    Reply
  • ScottPletcher
    October 3, 2007 16:42

    Great article!  Informative and thought provoking.  Btw, within the last day I helped someone on Experts-Exchange who had indeed used a NULL in a NOT IN list, so it does happen 🙂 .

    Reply
  • excellent post, you really saved my bacon trying to solve a select where name = @name (but if not specified) select all ‘name’ values  including nulls :)…  WHERE (ProductCategoryId = @ProductCategoryId)

    OR (ProductCategoryId IS NULL AND @ProductCategoryId IS NULL)

    OR (ProductCategoryId IS NOT NULL AND @ProductCategoryId IS NULL)

    Reply

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.

Close