This is the fourth post in the plansplaining series. Each of these blog posts focuses on a sample execution plan that exposes an uncommon and interesting pattern, and details exactly how that plan works.
In the first post, I covered each individual step of each operator in great detail, to make sure that everyone understands exactly how operators work in the pull-based execution plans. In this post (and all future installments), I will leave out the details that I now assume to be known to my readers. If you did not read part 1 already, I suggest you start there.
In episode #3, I promised that today’s post would be about bitmaps. However, bitmaps work only in parallel plans, and as I was working on that post I noticed that I needed to include a lot of explaining of the parallelism itself. So much, in fact, that just covering the parallelism issues by itself already makes for a lengthy post. So I decided to slightly change the example and focus on only parallelism today. Next month I will really cover those bitmaps!
A lot of the operators I touch upon in this post use some form of hashing. So here, for those who do not know this already, is a very short (and mostly simplified) explanation of what hash functions and hash tables are.
A hash function is a function that takes any input and converts it to a number in a predefined range, in such a way that the same input always results in the same number. Different inputs will sometimes also result in the same number; this is called a hash collision. Hash functions with a large range are typically optimized to minimize the chance of hash collisions. For hash functions with a small range, hash collisions are unavoidable. These hash functions are typically optimized to result in a balanced distribution.
A hash table is a table where rows are stored, organized by the result of applying a hash function to one or more columns. For each possible result in the range of the hash function, a container for rows (called a hash bucket, or just bucket) is allocated. All rows for which the hash function resolves to 1 are stored in bucket number 1; rows for which the hash function is 2 go to bucket 2, etcetera. Rows with the same values in the input of the hash function will always land in the same bucket. Rows with a different input for the hash function will usually land in different buckets, but may still end in the same bucket if there is a hash collision.
When SQL Server uses hash tables, they are stored in memory. If the amount of data exceeds the allocated memory, tempdb will be used as a temporary holding place for overflow data; this is called spilling and it generates “hash warnings” that are shown on the execution plan and can be caught in Extended Events or Profiler Trace sessions. A detailed discussion of hash table spills is beyond the scope of this post.
I normally like to take all my examples from the AdventureWorks sample database. But that database is too small to get parallel execution plans without setting the parallelism threshold to some weird non-representative low value. So I decided that for this post I am going to use one of Microsoft’s other standard demo databases, the ConsosoRetailDW sample database. This will also allow me to add in the bitmap operators in the next episode with only a small change in the query, something I could probably not achieve in AdventureWorks.
SELECT ds.StoreManager, dp.BrandName, SUM(fos.TotalCost) FROM dbo.FactOnlineSales AS fos LEFT JOIN dbo.DimStore AS ds ON ds.StoreKey = fos.StoreKey INNER JOIN dbo.DimProduct AS dp ON dp.ProductKey = fos.ProductKey GROUP BY ds.StoreManager, dp.BrandName;
If you run this query in ContosoRetailDW, you will see a rather large execution plan. In this post I will focus on the right-hand side, where the three tables are read and joined. (The left-hand side is where the aggregation happens, using a local-global aggregation pattern that I already blogged about in 2016).
Below is what the relevant part of the execution plan looks like on my laptop. (I added the small numbers in red, so I can easily reference each individual operator; they correspond to the Node ID property of each operator).
Since this query is processing over 12 million rows from the FactOnlineSales table, joining them to the two dimension tables and then doing aggregation on them, you will be happy to see this run as a parallel plan. But are you also happy with the four “Parallelism (Repartition Streams)” operators in this plan segment? Do you know what they do and why they are there?
Here is how Books Online describes what Parallelism (Repartition Stream) does:
The Repartition Streams operator consumes multiple streams and produces multiple streams of records. The record contents and format are not changed.
Ummm, yeah. So basically, this operator swaps the rows around between threads in a parallel plan. Rows are in one thread and are now kicked onto another thread. But this does not really explain why this has to be done four times in between just five other operators that are doing the actual work. And considering that the estimated costs indicate that 20% of the total query effort is expended here, you probably want to know if this is really needed.
Step by step
Let’s dive in, try to understand how all the operators in this execution plan cooperate, and see if that explains the Parallelism operators. I’ll start at the top left of the plan section shown above: Hash Match #4. As you probably know, the Hash Match operator runs in multiple phases. The first phase, called “build phase”, runs when the operator is initialized. During the build phase, the operator requests rows from its build input (shown on top in the graphical execution plan) and stores them in memory, in a hash table. However, before rows are even returned to this operator, a lot of things have already happened. So let’s first look at that.
Clustered Index Scan #6
To understand the logic of the build input, I’ll follow the data flow. It starts with Clustered Index Scan #6. By itself, this operator is not really that interesting. It simply reads all rows from the DimStore table and passes them to its parent operator. And, as always, this does not happen by itself, it is pull-based so it will return a row only when asked.
What’s different this time is that this operator is running in a parallel section of the execution plan. And the scan operator is “parallelism-aware”, which means that it changes its behavior when running in a parallel section. (Note that most operators are not parallelism-aware; scan operators, as well as seek operators that do range scans, are really the exception.)
To understand the behavior change, let’s first recall what a scan operator does in a serial plan. When called it returns the next row from the page it is currently working on. When it has processed all rows from a page and is requested to return another row, it reaches out to the storage engine to request the next page, from which it can then return a row. The calls to the operator are once per row, but the operator itself reaches out to the storage engine just once per page.
In a parallel plan, there are multiple copies of this same Clustered Index Scan operator, all working at the same time. Each of them channels their requests to the Storage Engine, and because this is a parallel plan these requests now go through the so-called Parallel Page Supplier. The Parallel Page Supplier simply always hands the next available page to the first thread that requests a new page. Putting this together, the calls to the scan operator are still once per row, but now on each of the individual threads. The calls to the storage engine are still once per page, coming from multiple threads in parallel. And the Parallel Page Supplier coordinates between these threads to ensure that every page is eventually handled by exactly one of the threads.
This effectively results in a demand driven page-by-page distribution of rows among threads. Because it is demand driven, it is impossible to predict what rows will go to which thread, so the distribution of rows among threads can be considered as “sort of random”. The screenshot I included shows how many rows were on each thread when I executed the query. But if I run the query again, the number of rows per thread will be different.
Parallelism #5 – shuffling the rows
This random distribution of data over threads is rarely acceptable. The optimizer usually needs to exert more control over how rows are distributed among the available threads. That’s what Parallelism operator #5 in this execution plan is for. As explained above, this operator reads rows from all the input threads and returns them on the “correct” thread.
Which thread is correct is defined in the plan properties. In this case, the Partitioning Type shows that a hash-based distribution is used, and the Partition Columns property shows that the hash function uses the StoreKey column from the DimStore table as its input. In most cases where an execution plan uses hashing it is impossible to find out any specifics about the hash function used. The Parallelism operator is an exception to that, though. Since the result of the hash function determines what thread the row goes on, and since this query runs on eight threads, this operator has to use a hash function with a range of eight values. On your system, the hash function’s range will be determined by your degree of parallelism.
This operator changes the distribution of rows from (semi-)random to deterministic. No matter how often I run the query on my laptop, the distribution of rows among the eight threads will always be the same. That by itself though is not sufficient reason to introduce this operator. After all, why would I care which of my CPUs processes each individual row, as long as the end result is the same? I have seen situations where a Parallelism (Repartition Streams) operator was added for the sole purpose of flattening out a very skewed distribution of rows between threads, but that is not the case here.
Hash Match #4
To recapture the explanation so far, I have described how Hash Match #4, during its initialization, repeatedly requests rows from its build input. This causes the Clustered Index Scan to read rows from the DimStores table. These are initially assigned to threads in a demand-driven way, but the Parallelism then forces each row to a specific thread as determined by the hashing the StoreKey value, for a reason that should become clear later.
The Hash Match operator itself stores all the rows it receives from its build input in memory, in a hash table. The Hash Keys Probe property shows that a hash function on the DimStore.StoreKey column is used for this. When the build phase is completed, all stores retrieved from the build input are stored in the hash table but no rows are returned yet.
It is important to note that, even though Hash Match #4 and Parallelism #5 both use a hash function on the same column, it is not the same hash function. The Hash Match operator prefers a hash function with a rather large range, in order to minimize the chance of hash collisions. The hash function used in Parallelism #5, with a range of just eight values, would never be chosen for a Hash Match.
The second phase, called “probe phase” then starts when the Hash Match operator receives its first GetNext call, typically immediately after the Initialize call returns. During the probe phase, the operator requests rows from its build input (the bottom input), applies then applies the same hash function as in its build phase but now to the value of FactOnlineSales.StoreKey (as seen in the Hash Keys Probe property). It then goes to the bucket identified by the hash and looks for matching rows in that bucket only. Matching rows are returned; when no more matches are found the next row from the probe input is tried. (And as always, control returns after returning a row, and resumes where it left on the next call).
The same StoreKey value, regardless of which table it is coming from, always results in the same hash value. So the algorithm only has to search for matches in that single bucket. That means it has to look at a few rows only – the “return on investment” for the (rather expensive) creation of the hash table. In most cases the Hash Match has to do an additional test on the join predicate, found in the “Probe Residual” property, to prevent bad matches being returned due to hash collisions. In this specific case there is no Probe Residual because, for integer columns such as StoreKey, the optimizer can find hash functions that are 100% collision free.
However, the description above is for a Hash Match operator in a serial plan. In this case the plan is running in parallel. Does this change how the Hash Match operator works? The answer to that question is “no”. Like most other operators, Hash Match is not “parallelism aware”. So it does not care or even know, that there are seven other copies of this operator, on the other seven threads, doing exactly the same work. Each of the eight instance of this operator simply creates its own hash table, fills it with the data it receives from the its own build input, then tries to find matches in it for the rows it receives from its own probe input thread. And it never sees any of the data handled by the other threads.
At this point you might be wondering how this can return correct results. Let’s say that a row from the FactOnlineSales table with StoreKey 185 is being processed on thread 3. The Hash Match operator that is running on thread 3 will read this row during its probe phase, but it will only find a match if the corresponding row from DimStore was also processed on thread 3. If that row happens to be processed on thread 5, then the Hash Match operator on thread 3 is blissfully unaware of its existence and would reject the probe input as not matching.
Since there are eight threads available, it appears that each match now only has a 12.5% chance of being detected. Obviously, since there are no headlines in the media about hugely incorrect results being produced by SQL Server, the optimizer has found a way to increase this chance to the 100% it should be. But we will if we look at one more operator: the Parallelism operator on the probe input.
Parallelism #7 – more shuffling
It is at this point not relevant how exactly the input for Parallelism operator #7 is produced. I’ll just start at Hash Match #8. Like most of the plan, this operator runs in parallel. And apparently the optimizer is unhappy with how the rows are distributed over the threads, so it has inserted the Parallelism operator to repartition the streams as it sees fit.
As seen in the properties, this Parallelism operator also uses hashing to determine the proper thread for each row, just as its brother #5. And this hashing function also uses a StoreKey as its input, but in this case it is the StoreKey from the FactOnlineSales table.
What is important to know (and you really need to know this, as this is not exposed in any way in the execution plan) is that this Parallelism operator uses the exact same hashing function as its brother, Parallelism #5. This fact is key to understanding what is happening here.
Going back to the example above, I used an example row from FactOnlineSales that has StoreKey 185 and that “happens to be” processed on thread 3. But it is now clear this thread assignment is not just coincidence. When Hash Match #4 asks for a row from its probe input, it is first being reassigned to a thread (in reality Parallelism operators are more complicated then this suggests, but that is out of scope for this post), and this thread is determined by hashing the StoreKey value. This means that all rows from FactOnlineSales that have StoreKey 185 are guaranteed to all go to thread 3. And, even more relevant: the row from DimStore that has StoreKey 185 was also shuffled to another thread, in Parallelism #5. But that operator used the same hash function, which we already know results in 3 when the input is 185, so this row from DimStore is on thread 3 as well.
Back to Hash Match #4
In other words: two parallelism operators (#5 and #7) both shuffle rows between threads. But they don’t shuffle at random. For rows from DimStore, the thread is determined, using some complex function, by their StoreKey. And for rows from FactOnlineSales (or rather, from the results of joining FactOnlineSales and DimProduct), the thread is determined by applying the exact same function to FactOnlineSales.StoreKey. As a result, it is 100% sure that all rows that can match based on the join condition between FactOnlineSales and DimStore are allocated to the same thread.
The Hash Match operator can just do its own thing, in its own world, without having to worry about other instances on other threads. The optimizer has guaranteed, by inserting some additional operators, that Hash Match will always find the rows it needs on its own thread.
The rest of the plan
In the explanation above I have skipped past everything that happens in Hash Match #8 and the operators it calls. But you should now see that a very similar pattern is used here. The build input of Hash Match #8 reads rows from DimProduct, then uses a hash function on ProductKey to assign them to a known thread. In that thread it is added to the hash table by Hash Match #8. The probe phase then reads FactOnlineSales and reshuffles the rows based on the ProductKey in this row, to ensure that these rows are on the same thread where a matching DimProduct row would be. The Hash Match can then find, join, and return the matching rows.
The results of Hash Match #8 will still be on the thread that is determined by hashing ProductKey (from either table, as they are the same in the joined results). There is no known relationship between ProductKey and StoreKey, so it is unlikely that this thread is the correct thread for Hash Match #4. That’s why Paralellism #7 is needed.
As should be clear now, all parallelism operators are actually needed. Parallelism operators #9 and #11 are needed because both their inputs are distributed semi-random, and Hash Match #6 needs its inputs to be distributed based on ProductKey. Parallelism operators #5 and #7 are needed because one of them has a semi-random imput distribution and the other has an input distributed based on Product Key, but Hash Match really needs both inputs to be distributed based on StoreKey. Leave out just one of these parallelism operators, and the execution plan would produce incorrect results.
The parallelism operators do contribute to the total cost of the execution plan. But that is not unusual in parallel plans. Coordinating the work always takes extra effort. The total CPU usage of a parallel execution plan is almost always higher than the total CPU cost of an equivalent serial plan. But because more threads work on it concurrently, the elapsed time does go down and the query returns quicker. The query shown in this post uses over 16.4 seconds of CPU time on my laptop but completes in less than 4 seconds. When I force it to go serial, the CPU usage goes down to 9.5 seconds, but the execution time is now also 9.5 seconds.
The overhead of parallelism is not always as explicitly visible as here. But it does always exist and you should be aware of it and make a conscious choice. Is this query important enough that having it finish in 4 instead of 9.5 seconds is worth the extra resource usage (that will hurt all other concurrent workloads)? Or do you value overall performance more, and decide to take the hit on this query? The correct action depends on who executes this query, how often, for what reason, at what time, and what else is running at that time. This is the part of our job where we really need to understand the business!
As promised in the introduction, plansplaining part five will finally look at the Bitmap operator that I already promised for this episode. However, I still like to see your input too! If you have ever run into a pattern in an execution plan that appeared to make no sense until you dug into the gory details, or even into a pattern that still makes no sense to you, let me know and I will consider it for inclusion in a future post. (When possible, please make sure that you can provide either a repro script or a copy of the actual execution plan, saved as a .sqlplan file).
Commenting on my own blog post … because I can.
Shortly after this post went live I received an email from Adam Machanic. He pointed out that I was not 100% accurate on how the Parallel Page Supplier works.
I wrote: “The Parallel Page Supplier simply always hands the next available page to the first thread that requests a new page”. That is correct for “small” tables. For large tables, though, the Parallel Page Supplier can hand multiple pages to a thread at once. How many exactly, and where the cutoff points are – well, this is where it gets messy and complicated.
Thanks, Adam, for increasing my understanding!