In part 1 of this series, I laid the foundation to explore the structure of the hash table, as used by the Hash Match operator, by alleging and then proving that a Hash Match (Left Outer Join) returns unmatched rows from the build input in the order in which they are stored in the hash table. This means that we can create queries on carefully curated data to gain insight in the structure of that hash table.
It is now time to use that trick to actually start to explore the hash table. But not without also looking at available documentation and common sense.
Available documentation
The first time I investigated the internal structure of hash tables was in 2017. At that time, I found a paper that mentioned some of the internals of hash tables as implemented in SQL Server. If I had been smart, I would have saved the URL or title of that whitepaper, and then I would have linked it here. But I was not smart enough to think of that back then, and I was unable to find it again later.
I do still know what I learned from reading the whitepaper. Not the full structure. That was not described. But it did mention three very important facts, that were a great starting point.
- The hash table does not store the actual row data. The row remains wherever it resides in memory. The hash table merely holds a pointer to each row.
- The hash table is built by allocating a block of memory that holds one block of up to six rows (or rather, pointers to rows) for each bucket.
- If a bucket has to store more than six rows (pointers), then additional blocks of six pointers are allocated and connected to each other as a linked list.
Why six?
I was surprised to see the statement that (pointers to) rows were always added in blocks of six. Why this number?
We all know that computer engineers love multiples of two. Multiples of two are very easy to work with in low level code. Allocating space in units of 2, 4, 8, or 16 pointers at a time makes much more sense than using 6 at a time.
However, we must keep in mind that there is overhead. The blocks are linked together in a linked list. That means that every block must include a pointer to the next block. That is also a pointer to a memory location, so it takes up the same amount of space as a pointer to a row. This means that we actually need room for 7 pointers for each 6 rows in the hash table.
But then it still would make more sense to allocate space for 8 pointers at a time, and then store 7 row pointers plus 1 linked list pointer in each. So why 6 and not 7?
This is where I have to speculate. I don’t believe that Microsoft actually chose to allocate memory for 7 pointers each time. I also refuse to believe that they allocated 8 and then never used the last bytes. There has to be a reason. My theory is that the eighth set of bytes is used to store some status information that helps to maintain the structure.
Some common sense
Let’s for now assume that what I read in that paper was correct at that time, and is still correct now; and also that my assumption that the actual allocation uses space for eight pointers is correct. And let’s then look at what this means for the build phase, when rows from the build input are read, and then stored in the bucket of their hash.
When the operator is initialized, the base structure of the hash table is allocated. This has to be a table with one entry for each possible hash result. So if a hash value is used that produces values between 0 and 65,535, then this means that we need memory for 65,536 entries. Each of those entries is 32 bytes, enough space for six data pointers (24 bytes), one linked list pointer (4 bytes), and the final 4 bytes are for either 1 reverse link pointer, or status info. It makes sense to assume that this is a contiguous block of memory, because that is what makes hash tables easy to navigate: compute the hash for a row, multiply it by 32, then add that to the base address of the hash table, and you have the location of the bucket for that hash value.
So, in graphical form, the hash table now looks like this:
Note that I visualized this for a hash table with just 12 buckets, which is completely unrealistic, but a lot easier to see than a more realistic hash table with several million buckets. Each block (row) shows eight cells, that each represent four bytes. The white cells are for pointers to data rows, the orange cells are for the linked list pointer to the next block of pointers, and the blue cell is for either status info or a link to the previous block. The numbers to the left of each row of cells are for your understanding only. They represent the hash value that the block represents, but, as explained above, are not stored in the hash table.
Adding the first rows
The hash table is still empty at this time. But that changes when the operator starts its build phase. Each time a row is read, the columns listed in the Hash Keys Build property get hashed, and then a pointer to that row is stored in the first free position in the block of six cells for that hash value. So, after processing the first few rows, the structure might for example look like this:
Every black dot in this picture represents a pointer, and the arrow indicates where it points. In this case, nine rows have been loaded; one of them hashed to bucket 1, six hashed to bucket 4, and two hashed to bucket 8. The data rows are still randomly scattered in the buffer pool or in working memory, but the pointers in the buckets provide the required structure.
Overflow blocks
At this point, the block of cells for bucket 4 is full. What do we need to do when the next row in the build input also hashes to this value? The answer is simple. We allocate a new block of 32 bytes, and set the orange cell of bucket 4 to point to this overflow block. We can then put the pointer to the new row in the first cell of this overflow block. So now the hash table looks like this:
For readability, I have removed the representation of the buffer pool / working memory from the picture. It still exists, of course, and every dot in a white cell still points to a row in working memory or in the buffer pool. Keep that in mind when you look at the picture.
Extending the chain
Let’s assume that the next six rows from the input all also hash to the value 4. The first five of them can be stored in the overflow block. But then we are out of space again, so we need to allocate yet another overflow block, and add it to the linked list. The instinctive human way to do this would be as follows:
For humans, the logical thing to do is to always add new data at the end of the list, to retain the order. But SQL Server does not need to follow human logic. SQL Server mainly wants to be fast. Due to the complete reordering of the hash function itself, any order that might have been in the input data is lost anyway. So what benefit would there be to retaining the order of rows within each bucket? None!
Adding a new overflow block at the end of the list requires SQL Server to navigate the entire list first. If there are already 600 rows in the bucket, then they are stored in the base block plus 99 overflow blocks. So we would need to traverse 98 pointers before we reach the last overflow block, at which point we add the next block. And after that, we’ll have to traverse 99 pointers every time we need to find a free cell to store the next row. That is not efficient.
So SQL Server does not add the overflow block in the humanly logical place. Instead, it adds an overflow block at the most efficient possible place, which is exactly in between the base block and the first overflow block. A new block get allocated, that new block then gets its pointer set to the what was the first overflow block, and then the base block’s pointer is changed to this block. The result then actually looks like this:
With this method, it does not matter how many rows there are in the bucket. The block currently being filled is always either the base block, or the last added overflow block, which is the one that the base block points to. We never need to navigate further during the build phase, because we know that all overflow blocks after that are full. If the first overflow block read has space, we can add a pointer to a row there. If not, we need to add a new overflow block. Even if there are a million overflow blocks, we never need to read them during the build phase.
Trust, but verify
In the picture below, I have replaced, for bucket 4 only, the dots that represent the pointers to rows with numbers that represent their order in the input:
At this point, if you read the data in bucket 4, you first encounter the six rows that were added first. But after that, because the pointer goes to the most recently added block, we then get row 13, plus five empty cells. That block then points to a previous overflow block, which has the seventh through twelfth row.
The experiment
All the above is theory and speculation. Granted, all based on documentation I can’t find anymore, and on common sense. But we have not verified any of this.
But this is where what we learned in the first post of this series comes in handy! We know that a left outer join will return unmatched data in the order it is stored in the hash table. So when we make sure that none of the rows in bucket 4 are matched, they will all be returned in the final phase, in the order of the table. Which, if our prediction is correct, would be first rows 1 to 6, and then the rest of the data in blocks of six rows but in sort of reverse order. So, all we need to do is find a way to get enough rows in the same bucket.
There are two ways for that. One is to find a hash collision: different values that hash to the same value. Without knowing the hash function that seems impossible – although I’ll still look at this in the next episode. But for now, we’ll have to stick to a simpler approach: multiple rows with the same value in the columns to be hashed, so that they are guaranteed to hash to the same value.
-- Create table variables to hold the data
-- (when using constants, the optimizer insisted on simplifying the query)
DECLARE @BuildInput table
(JoinCol char(1),
OrderCol int);
DECLARE @ProbeInput table
(JoinCol char(1));
-- Sample data - 25 rows, all with JoinCol = 'A'
INSERT INTO @BuildInput (JoinCol, OrderCol)
SELECT 'A', value
FROM GENERATE_SERIES(1, 25);
-- Verify that a Table Scan reads the data in ascending order
--SELECT *
--FROM @BuildInput;
-- Probe input remains empty, which guarantees no matches
-- Now run a query that returns the data in hash table order
SELECT *
FROM @BuildInput AS BuildInput
LEFT OUTER JOIN @ProbeInput AS ProbeInput
ON ProbeInput.JoinCol = BuildInput.JoinCol
OPTION (HASH JOIN, FORCE ORDER);
The query above does exactly that. I use GENERATE_SERIES to generate 25 rows, numbered from 1 to 25. If you want to run this on a version before SQL Server 2022, you will have to change this. The query that simply reads from @BuildInput only is just to verify that a Table Scan operator (check the execution plan to make sure that this is indeed the operator used!) returns the rows in ascending order. The last query uses the same trick as in the first part to peek into the structure of the hash table. Since we join on JoinCol only, and this column is the same for all rows, we know that all rows are in the same bucket.
Predicted and actual output
Based on the theories above, we should expect to get first rows 1 – 6, because they are loaded in the base block. Rows 7 – 12 went to the first allocated overflow block. Because all later overflow blocks are inserted between the base and the first overflow, this block has remained the last in the chain, so these rows should be last. Rows 13 – 18 should be before this, and so on.
Before looking at the output, let’s double check the execution plan to see that it does not do anything unexpected.
The execution plan confirms that the build input gets its rows from a Table Scan on the @BuildInput table. The simple SELECT query on that table, and its execution plan, prove that this means that the rows are received in ascending order of the OrderCol column. So this means that the numbers in the OrderCol column of the output represent the order in which the rows were received by the Hash Match operator.
Time to look at the results.
If you look at these results, you see that it exactly matches the expected order. The first six rows are in the base block. After that, there is the last allocated overflow block, with row 25 and five empty spots. The block allocated just before that comes next, with rows 19 – 24; then comes 13 – 18, and the first allocated overflow block, with rows 7 – 12, comes last.
Coming next
We have confirmed what we had deduced, based on available literature and some common sense about how data gets inserted. Even without using a debugger, we were able to get a fascinating (yet probably useless) peek into the internal structure of the hash table.
As mentioned in the previous episode, we can use this knowledge to find hash collisions in our data. But that is for the next episode.












2 Comments. Leave new
If the bucket list is not a simple linked list but a doubly linked list, that would explain the six elements plus one overhead disparency with 2^3.
A single linked list can be traversed in one direction only. A doubly linked one has pointer to the previous element too, so one can traverse both ways. This is usually a desired property to move around in an overflown bucket to look up near-neighbours.
Thanks for your comment, vP.
I have also entertained the idea of a doubly linked list. I even mentioned it in the draft version of this post. But I removed it, because it makes no sense.
Yes. In a generic scenario, a doubly linked list can offer benefits. But none of these apply to hash tables. The only operations that are carried out on Hash Match’s hash table are: add a row to its bucket, search a bucket for all rows with specific values in the hash columns, and read all rows from a bucket. Each of those operations can be done perfectly well with a simple (singly) linked list. The only scenario where a doubly linked list would be adding rows if the order needs to be retained. But as the demo in this blog shows, that order is not retained.
That’s why I removed all mentions of the doubly linked list from my draft before posting it.
But I do appreciate the theory. And who knows? Perhaps you are right after all.