A deep dive into hash tables, part 3

A deep dive into hash tables, part 3

Welcome back! In the previous parts, I first showed how a Hash Match (Left Outer Join) can give insight in the order of data in a hash table, and then used that trick to obtain and verify some interesting insights into the internal structure of such a table. It’s now time to see if this same trick can also be used to find hash collisions.

The foundation for this part are these two conclusions from the previous parts:

  1. Unmatched rows from the left (top) input are returned bucket by bucket, with rows that are in the same bucket returned in whatever order they are stored in that bucket.
  2. Buckets that have up to twelve rows store the rows in the order that they were received by the Hash Match operator; when there are more than twelve rows, then each next set of six is inserted after the first block of six. (So with for example 20 rows, the order would be 1 – 6; 19 – 20; 13 – 18; 7 – 12).

How to recognize a hash collision

In the previous part, I deliberately used many rows with the same join column, to ensure that these rows go into the same bucket. However, a hash collision happens when rows with different values in the join columns still end up in the same bucket. This can of course only happen when the join column (or columns) has more than one distinct value.

Getting ordered data

But it’s not enough to have data where hash collisions are possible. We need a way to actually recognize them. In order to make this possible, we’ll use two copies of each value in the join column, but make sure that they don’t follow each other immediately when the hash table is built. We can do so by adding a second column, and then ordering by that column, for instance as shown in this query:

DECLARE @NumRows   int = 200,
        @NumCopies int = 2;
WITH BuildInput
AS (SELECT     TOP (@NumRows) j.value AS JoinCol,
                              o.value AS OrderCol
    FROM       GENERATE_SERIES(1, @NumRows / @NumCopies) AS j
    CROSS JOIN GENERATE_SERIES(1, @NumCopies)            AS o
    ORDER BY   OrderCol,
               JoinCol)
SELECT *
FROM   BuildInput;

Note that there are no official order guarantees for this code. The ORDER BY is in a subquery, not at the end of the query. The optimizer does not have to observe it. It does have to use it in order to make sure to satisfy the TOP requirement. You and I both know that we actually always return all rows, but the optimizer does not know this. And we see that in the execution plan:

The execution plan for the query shows a Nested Loops that combines data from two Table Valued Function operators and sends its results into a Top operator. The Top operator returns to SELECT.

The optimizer has swapped the inputs, to ensure that the Nested Loops returns the data in the required order already, without having to add a Sort operator to the execution plan. The Top operator then returns the first 200 rows from this sorted data – actually all of them.

While the query does not guarantee any order of the output, this execution plan does! The Top operator preserves order, and there are no other operators in the execution plan that can change the order. So, as long as we get this specific execution plan (and note that this is not guaranteed!), we will get the generated data – @NumValues distinct values for the join column, each with @NumCopies copies numbered from 1 in the sort column – in ascending order of SortCol, JoinCol. If you look at the results, you will see this confirmed. First all 100 rows with OrderCol = 1, in order of Join Col 1 up to 100; and then the next 100 rows with OrderCol = 2.

What’s the point?

So now that we have query that can generate any amount of data in that specific order, you might wonder: what’s the point?

Well, let’s start with a simple example. Let’s set @NumRows to 4, while leaving @NumCopies at 2. That will of course give us this output:

The query results show two columns, JoinCol and OrderCol; and four rows, with values 1, 1; 2, 1; 1, 2; and 2,1, in that order.

If we now send these rows, in this order, to the build input of a Hash Match, then what we would normally expect is that first, the first row goes into the bucket of whatever the hash of join column 1 is, then the second row becomes the first row in the bucket for the hash of the value 2. After that, rows 3 and 4 become the second row in each of those buckets. So if we then use the now familiar trick of a left outer join without matches to look at the data in the order of the hash table, we would either first get the two rows for JoinCol = 1 and then the two for JoinCol = 2, or the other way around.

Well, except when the values 1 and 2 are a hash collision, in which case all these rows go to the same bucket. We are under 12 rows in total, so they are in the bucket (and returned) in the same order as they were entered into the hash table, which is the order shown above. That order is only possible when 1 and 2 are in the same bucket, the rows would never be returned in that order if they are in different buckets. So this is how we can now identity hash collisions, even without knowing the hash function and without using a debugger to actually look at the memory structures of the hash table.

Put the theory to test

It’s time to put the theory to test. I will use the same BuildInput CTE as in the previous examples, but now match it to a probe input that has no matches. (Note that I do need to have at least two rows in the probe input, to prevent the optimizer from simplifying the query).

DECLARE @NumRows   int = 4,
        @NumCopies int = 2;
WITH BuildInput
AS (SELECT     TOP (@NumRows) j.value AS JoinCol,
                              o.value AS OrderCol
    FROM       GENERATE_SERIES(1, @NumRows / @NumCopies) AS j
    CROSS JOIN GENERATE_SERIES(1, @NumCopies)            AS o
    ORDER BY   OrderCol,
               JoinCol)
SELECT               *
FROM                 BuildInput
LEFT OUTER HASH JOIN (   VALUES (-1), (-2)) AS ProbeInput (JoinCol)
  ON ProbeInput.JoinCol = BuildInput.JoinCol;

Before we even look at the output, we need to verify the execution plan, to make sure that the optimizer has not made any unexpected changes that might ruin our plans. On my laptop, I get this execution plan:

The execution plan shows a Hash Match operator. Its build input comes from a Top that draws from a subtree where two Table Valued Function operators are joined with Nesteed Loops, and then sorted in a Sort operator before being passe to the Top. The probe input of the Hash Match is just a simple Constant Scan.

We do see a Sort as the child of the Top operator this time. The reason is the HASH join hint in the query. Every join hint has the side effect of enforcing the order of all table and joins in the query, similar to the FORCE ORDER query hint. So now that the optimizer is no longer allowed to swap the order of the two GENERATE SERIES expressions to get the data in the correct order, it has to explicitly sort the data to make sure that the Top operator then gets the correct rows. (If I had cared about performance, I could have swapped the two expressions in the query. But this is just a demo, and it runs fast enough for me).

But the most important observation is that the data flows straight from the Top into the build input of the Hash Match. The input to the Top is ordered, and Top does not change the order, so we now have confirmation that Hash Match does indeed receive the data in the expected order. With that knowledge, it is now interesting to look at the results:

Four rows of data, with three columns: JoinCol, OrderCol, and JoinCol again. The order of the four rows is: first two rows with JoinCol = 1 (with OrderCol 1 and 2, in that order); then two more rows with JoinCol 2 (and the same data for OrderCol).

The results above confirm that the values 1 and 2 in JoinCol are not a join collision, and that, apparently, 1 hashes to a lower value than 2.

Finding the collisions

While it is in theory possible to adapt the query above to test for any given pair of two values whether they have the same hash value, doing so would be very impractical and highly time consuming. So instead, we can just increase the @NumRows to a higher number to try to find all hash collisions in the specified range. For instance, this is the first part of the output for @NumRows = 100 (and @NumCopies still set to 2):

The picture shows the first 13 rows of the output of the query. Rows 7 - 10 are highlighted.

Most of the data in the picture above is as expected: two subsequent rows for a single JoinCol value, with OrderCol equal to 1 and 2. But the highlighted rows form a different pattern. Here, we first see two rows with OrderCol 1 for JoinCol 9 and 15, and then two more rows for 9 and 15 but now with OrderCol 2. This pattern proves that the values 9 and 15 must hash to the same bucket. We found a hash collision. If you scroll down through the list, you might see a few more.

Note that the picture above shows the data as it is returned on my system, under its current workload, on my current CU level of SQL Server 2022. You might see different results on your system. Even I might see different results if I patch my system, or perhaps even when available memory or other factors change. The actual hash functions used are not documented, and they can change without notice.

More values

Frankly, I must admit that I was surprised to find several collisions already in a set of just 50 distinct key values. I had expected to need way more input rows. But the data doesn’t lie, and the results are very clear. However, it can still be interesting to look in larger sets. Perhaps we can find examples of collisions of three, four, or even more values!

But increasing the data set introduces new challenges. For instance, when I set @NumRows to 100,000, I get results that can absolutely not be explained by anything we have learned so far:

The first 8 rows of the query results show JoinCol 8893, 8896, 8899, 8902, 8905, 8913, 8916, and 8919, in that order. All have OrderCol equal to 2.

And even stranger is that on subsequent executions, the data is sometimes returned in even different order, that sometimes does and sometimes still does not match what we have seen so far. All of this seems rather weird – unless we also look at the execution plan. Here is what we see on the first execution:

The same execution plan as before, but the actual unmber of rows on all operators is much higher, and there is a warning symbol on the Hash Match and the Sort operators

The estimates are the same as before: 50 for each of the GENERATE_SERIES calls, and 100 for the TOP. These are fixed estimates, and the cardinality estimator only deviates from these under very specific circumstances, that do not occur in this case. The result is that the required memory for the Sort is based on 2500 input rows, and for Hash Match on 100 input rows. For this execution, that is a gross underestimation, which causes both the Sort and the Hash Match to spill to tempdb. We can see that in the execution plan by the exclamation mark, and you can get more information by hovering the operator.

In this specific case, I really don’t care about the spill of the Sort. Yes, it slows down execution. But, again, this is just demo / research work on my personal laptop, and the query finishes fast enough. Spill or no spill, Sort returns data in the specified order, which is then retained by Top, so we know that the rows are loaded in the hash table in the expected order. That is all that matters.

The spill of the Hash Match is a different issue. I won’t go in the details of this, as those are out of scope for this post, but the bottom line is that a spill causes a Hash Match operator to return data in a very different order as compared to a Hash Match that never runs out of memory.

Getting rid of the spill

If Memory Grant Feedback is enabled on your system, then this problem should disappear after a few executions. On my laptop it did. Sometimes. And sometimes it just randomly resurfaced, because for reasons I really don’t understand, SQL Server sometimes throws queries and execution plans out of the plan cache. (People told me that this is because I run on a laptop with a home OS, not on a production server with a server OS, which means other processes might be requesting memory and SQL Server has to yield some. However, I have plenty of unused memory on my laptop, and no memory-hungry applications running, so I’m not sure if that explanation is correct.)

So I can avoid spills and get the interesting order if I run the query multiple times and then hope it was not evicted from the cache at just the wrong time. That does not seem to be a very reliable way to run my experiments. Let’s try something else. Adding a RECOMPILE hint to the query makes the optimizer inline all variables, which should result in correct estimates for this execution plan. Sadly, it turns out that this works for a TOP(@variable) expression, but not for a variable passed into GENERATE_SERIES. When I set @NumRows to below 2500, for instance to 1000, I get an execution plan like this:

The estimates in the execution plan show that TOP is estimated to return 1000 rows, but the GENERATE_SERIES is still estimated at 50 rows for each, so 2500 after joining.

In comparison to the previous execution plans, the Top and Sort operators are now collapsed to be a single Sort (Top N Sort). That is just a small optimization, but not a functional difference. The only reason this was not done in the previous execution plans is that Sort (Top N Sort) only works with a constant value, whereas the Top operator also works with an expression (such as @NumRows) as the number of rows to return.

But let’s look at the cardinality estimations. For the Sort (Top N Sort), the estimate is now indeed 1000, exactly the value we passed in for @NumRows. The recompile hint allowed SQL Server to inline it. However, each of the two Table Valued Function operators, that in this case implement the GENERATE_SERIES function, is still estimated to return 50 rows per execution, even though the recompile hint should have allowed SQL Server to sniff the values and optimize these for 500 and 2 respectively.

This means that, if I return to my intended 100,000 rows, I still don’t avoid a spill of the Hash Match:

The execution plan has the same shape as before. The estimate for the Sort (Top N Sort) is not 100,000, though, but only 2,500. The Hash Match operator has a warning symbol.

Even though we already established that SQL Server will inline the variable for the TOP expression, the Top operator does not have the expected estimate of 100,000 rows. It is estimated at just 2,500. The reason for this is that the variables for GENERATE_SERIES are not inlined; they are still compiled with their fixed estimate of 50 rows. That means that their join is estimated to produce 2,500 rows. Even though the optimizer knows that the TOP expression wants 100,000 rows, it believes the input is only 2,500 rows, and it can’t ever add rows to that.

We still have a too low estimate, which still results in a spilled Hash Match.

Hard coded values

Apparently, GENERATE_SERIES always uses an estimate of 50, unless all its parameters are hard coded constants. So, the way to get the optimizer to do what we want is to use hard coded constants as parameters to GENERATE_SERIES. That means I now change the query to look like this:

DECLARE @NumRows int = 100000;
WITH BuildInput
AS (SELECT     TOP (@NumRows) j.value AS JoinCol,
                              o.value AS OrderCol
    FROM       GENERATE_SERIES(1, 50000) AS j
    CROSS JOIN GENERATE_SERIES(1, 2)     AS o
    ORDER BY   OrderCol,
               JoinCol)
SELECT               *
FROM                 BuildInput
LEFT OUTER HASH JOIN (   VALUES (-1), (-2)) AS ProbeInput (JoinCol)
  ON ProbeInput.JoinCol = BuildInput.JoinCol
OPTION (RECOMPILE);

This is far less elegant. If I increase or decrease @NumRows, I now also need to manually adjust the stop value for the first GENERATE_SERIES. But at least, I now get an execution plan where all the estimates are completely correct. But … the execution plan now also went parallel on my laptop! And that makes sense. Now that the optimizer knows that it has to join 50,000 rows to 2 rows, then sort the resulting 100,000 rows, and then load them all in a hash table, the estimated query cost rises above 5, the default cost threshold for parallelism that I usually leave unchanged on my personal laptop. (Note that I agree with the common advice to increase this default for production systems!)

Now, is this bad? Well, yes. It is! Because, when results are produced on multiple threads, then there is no way for us to control the order. The order of results on each thread will still be as expected, and should allow us to do our investigation. But we can’t control how fast each thread runs. And this affects the order of the results. We might get three rows from thread 4 first, then six rows from thread 2, then another row from thread 4, then twenty from thread 1, and so on. (Okay, in reality, it would go by packet and not by individual row, but the basic principle still applies). And that means that the order of the rows in the output will change on each subsequent execution, and is in no case still representative of the order of the data in the hash table.

One way to get rid of this (for this case) unwanted parallelism would be to increase the cost threshold for parallelism on my laptop. But then I risk a demo failure the next time I do a presentation at a conference if I forget to change it back. So I use the far easier method to just slap on a MAXDOP 1 query hint and be done with it. (Again, this is my personal laptop. None of this is good advice for production systems. In fact, I would be highly concerned if you try to reproduce any of the queries I have in my blog posts on a production system).

DECLARE @NumRows int = 100000;
WITH BuildInput
AS (SELECT     TOP (@NumRows) j.value AS JoinCol,
                              o.value AS OrderCol
    FROM       GENERATE_SERIES(1, 50000) AS j
    CROSS JOIN GENERATE_SERIES(1, 2)     AS o
    ORDER BY   OrderCol,
               JoinCol)
SELECT               *
FROM                 BuildInput
LEFT OUTER HASH JOIN (   VALUES (-1), (-2)) AS ProbeInput (JoinCol)
  ON ProbeInput.JoinCol = BuildInput.JoinCol
OPTION (RECOMPILE, MAXDOP 1);

And now, we finally get the execution plan that we want, with no hash spills even on the first execution, and no parallelism:

Once more the same execution plan shape as before, but now all estimates match perfectly with the actual numers, and there are no exclamation symbols anywhere.

We overcame all challenges. We got the 100,000 desired rows to be sorted as specified, and then loaded into the hash table, which does not spill. The probe phase of the left outer join finds no matches. The final phase then returns all data in the hash table, which means that, once more, the order of the data gives us a lot of information about how the data was stored in the hash table.

The first 14 rows from the query results have these values for JoinCol and OrderCOl: 22814, 1; 28387; 1, 22814, 2; 28387, 2; 1566, 1; 15675, 1; 35526, 1; 49635, 1; 1566, 2; 15675, 2; 35526, 2; 49635, 2; 37092, 1; 37092, 2.

We already see a group of two values that have a hash collision, then a group of four, and then a single value that apparently does not collide with any other key value. But the total results are 100,000 rows. Analyzing all this output by eye is a very time consuming (and equally mind numbing) task. Would it not be possible to use the power of SQL Server to do further analysis on this data? Of course!

However, since this post is already quite long, I have decided to change my original plans and move that automated analysis to the next episode.

Coming next

We have found a way to find hash collisions, by looking at the results returned from a query that was specifically crafted for this purpose. But for larger results, we don’t want to look over the entire output. We want our computer to do the heavy lifting for us! That will be the topic of the next episode.

And after that, it is finally time to dive into the structure of data that spills from the hash table to disk. Or, at least, to give it a try. Success not guaranteed!

 

T-SQL Tuesday 190 – Mastering a technical skill
Black Friday 2025

Related Posts

No results found.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.

Close