Tag: Plansplaining

Plansplaining, part 10. Just passing through

Welcome to part ten of the plansplaining series. Each of these posts takes an execution plan with an interesting pattern, and details exactly how that plan (or pattern) works.

In this post we will look at a query and execution plan that may appear perfectly normal and unexpected at first sight, but that has some perhaps confusing execution counts.

Sample query

The sample query below will (as almost always) run in all versions of the AdventureWorks sample database. It returns a list of all staff, and for people that have a sales-related job title it adds the total of sales they made.

The execution plan for this query (captured on SQL Server 2017) is shown below; as always I added numbers (equal to the NodeID of each operator) to the screenshot, to make it easier to refer to operators within this plan.

At first sight, this execution plan is not very surprising. At the top right, we see a scan on the Employee table that drives a Nested Loops join into a seek on Person – this is a very simple and direct implementation of the join in the query. The result then goes through another Nested Loops, into a branch that reads SalesOrderHeader, aggregates it, and then does a computation – this obviously has to be the subquery within the CASE expression. Finally, Compute Scalar #0 probably does the logical evaluation of the CASE. There, without even looking at a single property I already have a fair idea what this execution plan does.

The index spool

The above explanation doesn’t mention Index Spool #8. The most common cause for Index Spool operators in a plan is when SQL Server wants an index really bad. The SalesOrderHeader table does have an index on the SalesPersonID column but it is not covering for this query, so the optimizer was faced with the choice between either using that index and accepting the cost of a lookup, or using a scan … and then it decided to go for door number three, the Index Spool that basically builds the index on the spot and throws it away when the query is done.

That explanation is indeed correct for this case. But it’s not the whole story as you will see when you read on. But let’s first look at one other oddity in this plan.

The missing operator

When the optimizer assigns each operator a NodeID value, it always works left to right and bottom to top, always starts at zero, and always increments by one. And yet, the execution plan above appears to have a gap. There is no operator with NodeID 3. And that is always a sign that something interesting is going on. It means that an operator was removed from the plan in a final cleanup phase, after the plan had already been chosen.

In this case, the reason for the missing operator can be found in the properties of Nested Loops #2. The missing operator #3 is related to prefetching, an optimization used to improve the performance of physical I/O on the inner (bottom) input of the Nested Loops operator. SQL Server implements two types of prefetching: ordered and unordered. Neither of these result in a fixed order being imposed on the data, however ordered prefetching does guarantee that the order of rows from the upper (top) input is preserved, whereas unordered prefetching might change the order of rows.

The WithOrderedPrefetch and WithUnorderedPrefetch properties of a Nested Loops join only have any effect on the performance if data to be read from the inner input is not yet in the buffer pool. Instead of waiting for the Nested Loops operator to activate its inner input, then request data, and then fetch it into the buffer pool, data flowing into the outer input of the Nested Loops operator is inspected in advance to check whether it will result in physical I/O. If it does, then it will already start an asynchronous I/O request, in the hope that the read will  have completed and the data will already be in the buffer pool by the time it is actually needed.

When SQL Server adds prefetching to a Nested Loops operator, it adds s special “read ahead logic” operator in the execution plan. This operator will be positioned at the outer input of the Nested Loops, so in this case between Nested Loops #2 and Clustered Index #4. This special operator stays there until after the plan selection has been made; at that time a final cleanup phase will fold this special operator into the Nested Loops operator, leaving a gap in the otherwise consecutive NodeID numbering. So that’s why there is no operator #3 in this execution plan.

If you want to learn more about prefetching, I recommend reading Craig Freedman’s blog post, Paul White’s blog post, and Fabiano Amorim’s article. The rest of this blog post will focus on other areas of this execution plan.

How many executions?

If you look at all the various counters in the execution plan, you might see some unexpected numbers. The screenshot to the right shows the data flowing out of Nested Loops #2 into the outer input of Nested Loops #1. The Actual Number of Rows is 290. So naturally, you would expect the operators on the inner input of Nested Loops #1 to execute a total of 290 times, right?

Turns out … you’d be wrong!

If you look at the far right, at Clustered Index Scan #9, you will see that the Number of Executions is 1, and that this matches the Estimated Number of Executions. This part is not the surprise. Clustered Index Scan #9 is called by Index Spool #8, an eager spool. We know that this spool on its first execution will fetch all data from Clustered Index Scan #9, store it in a worktable with an index to support its Seek Predicate, and then on all later executions it will fetch data from that worktable without invoking Clustered Index Scan #9 again. So the single execution of this operator is no surprise.

But let’s now look at Index Spool #8 itself. As you can see in the screenshot to the right, this operator was estimated to execute 278.482 times, which matches the Estimated Number of Rows in the outer input of Nested Loops #1. But … the actual execution count of this Index Spool is not the expected 290; we only see a mere 18 executions! What happened to the other executions?

The first step is a quick reality check. Yes, I did not look in the wrong place, Nested Loops #1 really receives 290 rows on its outer input. It also returns 290 rows to its parent (Compute Scalar #0), and I can see that the final result set of the query has the full 290 rows.

The next step is to look where the number changed. The properties of Stream Aggregate #7 show the same numbers for estimated and actual number of executions; the properties of Compute Scalar #6 also show the same estimate, and include no actual count at all (not unusual for a Compute Scalar; an in-depth explanation of this curious behavior is properly better left for a future post). For now, it appears that Nested Loops #1 is the root cause. Even though it processes and returns 290 rows, it only actually executes its inner input 18 times. Let’s look in a bit more detail.

Nested Loops #1

Hovering the mouse over Nested Loops #1 shows a property that is not often seen on Nested Loops operators. In fact, many people have never seen this property and are unaware of its existence. It is the Pass Through property, a property I have so far only seen on Nested Loops operators (and only for Logical Operation Inner Join or a Left Outer Join).

When the Pass Through property is present, the Nested Loops will for each row it reads on the outer input first evaluate the condition listed in this property. If the condition evaluates to true, then it will, as the name of the property implies, just pass its input row on, unchanged, to the parent operator. On the next call, it will then immediately move on to the next row on its outer input.

So simplified, you could say that Pass Through means: if this condition is true, then you can just skip all the join logic for this row and pass it through, unchanged.

For the columns in the Output List that take their values from the inner input, no value can be given if the join was skipped due to the Pass Through property. I don’t know if they are always set to NULL in that case or if they are simply left at whatever they are and the rest of the plan makes sure to never look at that data.

In this case, the Pass Through property uses a function called “IsFalseOrNull”. This value can only be used internally within execution plans (I personally would not mind having it available in T-SQL as well, though!). It is intended to help in dealing with three-valued logic. The net effect of the entire condition is to check whether the JobTitle column includes the word “Sales”. If so, then that condition is true, which means that IsFalseOrNull returns False, which in turn means that the pass through logic is not invoked and the inner input of the Nested Loops operator executes as normal. However, when JobTitle does not include the word Sales, or when it is NULL, then the LIKE condition is either False or Unknown, and then IsFalseOrNull evaluates to True, which means that the pass through condition kicks in and the inner input is skipped.

If you look at the data produced by joining the Employee and the Person tables, you will notice that there are 290 rows total, but only 18 have a JobTitle that includes the word Sales. The Pass Through condition kicked in on the remaining 272 rows, and that is why the operators on the inner input of Nested Loops #1 executed only 18 times instead of the expected 290 executions.

We all love better performance!

The Pass Through expression in this execution plan is clearly intended to improve performance. And at first sight that is exactly what it does. If you remove this property and leave the rest of the plan unchanged, then Compute Scalar #6, Stream Aggregate #7, and Index Spool #8 will all execute 272 additional times. And each of those times they will produce a result that then subsequently is ignored – because of the CASE expression in the query, Compute Scalar #0 only actually does something with the value returned from these operators if “Sales” is included somewhere in the JobTitle, which is only the case for the 18 rows where the Pass Through property did not prevent execution of this branch.

And obviously, the reason that the optimizer even included this Pass Through property is that it, too, understood the significance of the CASE expression. The optimizer knew that the value returned by this branch will only be used in some cases, and hence needs not be computed at all in other cases. (What I do not understand, though, is why the Estimated Number of Executions for the operators in this branch do not reflect that understanding…)

But you might want to ask why the CASE expression is even there at all? It is of course possible that it is needed for correctness. Perhaps the application does sometimes allow a sale to be registered to an employee who does not have “Sales” as part of their job title. If we definitely do NOT want to see those sales included in the report, then the query definitely has to be written as it currently is. End of discussion.

But is it?

However, a query such as this is also very often the result of a developer looking at the query, looking at the data, looking at the business rules, and then deciding to help the optimizer a bit. In the case of the AdventureWorks sample database, all employees that have any sales to their name actually do have “Sales” in their job title. So for this database, the query could also have been written as this, and it would return the same result.

Perhaps the query started like this. And then, perhaps, there was a developer looking at this query and deciding “You know what, SQL Server is going to tally up SalesOrderHeader for all staff members, even though most of them are not in sales and will therefor not have any sales in there. I’ll just step in and add a CASE expression to help SQL Server skip the executions for all those non-sales folk!”

Now before I go on, I am actually happy with these developers. I really prefer developers that give thought to performance over those that don’t care. However, in this case the developer was wrong. Good intentions, but bad results. And when I find the original query in my production code, I’ll find the developer, compliment them on their attention to detail, and then remind them that they should definitely also test.

This is the execution plan I got when running the version of the query without the CASE expression. The logic to join Employee and Person is unchanged, but now the logic to compute the total sold for each employee in a sales role has changed. Instead of a Nested Loops with multiple passes. We now get a Merge Join with a single pass, and with early aggregation on a Hash Match. And the Index Spool has been completely removed.

I won’t go into the details of this execution plan. But if you compare the two versions of the query by running them with SET STATISTICS IO ON and SET STATISTICS TIME ON, you will see that both queries have the same amount of I/O on the base tables, but the first query also has a lot of logical reads on a worktable, due to the Index Spool. The second query does have both a worktable and a workfile – these are there in case the Sort or the Hash Match operators need to spill. But neither does spill, so there is no actual I/O on those.

Mainly as a result of all the additional I/O on that worktable, the original query (with a CASE expression in the query, and with a Pass Through expression and an Index Spool operator in the execution plan) also takes more time – 114 ms vs 46 ms on my laptop.

However, do not forget that all this bases on one very important assumption: that given the way the application works, we can be sure that the two versions of the query always return the same data. Based on the query alone that is not guaranteed, so unless the application gives us additional guarantees the two queries are not equal, and you should always opt for the one that returns correct results, regardless of performance.

Conclusion

The main topic of this post is the Pass Through property. It is not a very widely known property, but it can be important to understand how an execution plan really works. After all, you would not want to invest a lot of effort to tune a branch of an execution plan only to find out later that it only executes for a fraction of the rows.

However, we also saw an example of what people might call “premature optimization”: a change to a query intended to help the optimizer but insufficiently verified as to whether it really meets that intended goal. Query optimization is complex, and changes that appear to be beneficial for performance do not always result in an execution plan that is always faster. And it is also very important to realize that in this case the two queries were only the same based on the data currently in the database; unless there are constraints in place (preferably in the database, but sometimes enforced by the application) to ensure that this is always the case, the two queries might return different results and only one of them should be considered correct.

Finally, we had a short discussion on Nested Loops prefetching and how that results in a gap in the NodeID numbers of the operators in an execution plan.

That concludes part 10 of the plansplaining series. I still have a few ideas left on my list for future episodes, but it is growing shorter – so please, help me out if you can! If you see an unusual and interesting pattern in an execution plan, let me know so I can add it to my list!

Tags:

Plansplaining, part 9. Recursive CTEs

I had to skip three months, but finally it is here: part eight of the plansplaining series. Each of these posts takes an execution plan with an interesting pattern, and details exactly how that plan works.

I am pretty sure that (almost) everyone reading this blog knows that a CTE (Common Table Expression) is an independent subquery that can be named and then referenced (multiple times if needed) in the main query. This makes CTEs an invaluable tool to increase the readability of complex queries. Almost everything we can do with a CTE can equally well be done by using subqueries directly in the query but at the cost of losing readability.

However, there is also one feature that is actually unique to CTEs: recursion. This blog post is not about what this feature is, what it does, or how to write the code. A brief description is given as part of the complete explanation of CTEs in Books Online, and many bloggers have written about recursive CTEs as well.

For this post, I’ll assume the reader knows what a recursive CTE is, and jump into the execution plan to learn how this recursion is actually implemented.

Sample query

The query below is modified from one of the examples in Books Online. This query can be executed in any version of the AdventureWorks sample database. It returns the exploded Bill of Materials for the “Road-550-W Yellow, 44” bicycle (which has ProductID 800).

The execution plan for this query is quite complex, as you can see below. (As in earlier episodes, I have added the NodeID numbers into the execution plan to make it easier to reference individual operators).

This execution plan does several things that are fairly easy to understand. And then there is some other, much more interesting stuff to make the magic of recursion happen. To prevent this post from becoming too long, I will describe the simpler parts very briefly. Feel free to run this query on your own system to dive deeper in the parts I skip. Make sure you understand them, and don’t hesitate to leave a question in the comments section if anything is unclear to you!

The top line

As always, execution of the query starts of the top left with operators calling their child operator (from the first input in the case of Concatenation #1), until Clustered Index Seek #4 is invoked. The properties of this operator show that it uses an index on (ProductAssemblyID, ComponentID, StartDate) to quickly find all rows that match the predicate of the anchor query. It processes them in order of ascending ComponentID, returning the three columns from this table that are used in the query.

As always, it is important to remember that an operator never just returns all rows; it waits until it is called, does the work needed to produce and return the next row, and then saves state and halts until it is called again. In this case we will see that the operators on the top row of the execution plan keep calling each other until all matching rows are processed so it not especially relevant in this case but it’s still important to always be aware of this!

Compute Scalar #3 receives rows from Clustered Index Seek #4. It adds one extra column, named Expr1001, that is always set to 1 (one). We will later see that this Expr1001 column will eventually be returned to the client as the Lvl column; the fixed value of 1 in this operator confirms what we already suspected, that the top line evaluates the anchor query.

Compute Scalar #2 then adds yet another column, Expr1012, that is 0 (zero). We will later see what this column is used for.

The rows are then returned to Concatenation #1, which implements the UNION ALL expression in the query. Like any Concatenation operator, it simply reads rows from its first input and returns them unchanged. Once the first input is exhausted it moves to the second input (which I will discuss further down). Concatenation operators can have more than two inputs, but in this execution plan it has just two.

A side effect of a Concatenation operator in an execution plan is that it renames all columns, which can make it harder to follow what is going on. I often end up writing down the column names in a file on a second monitor, or just on a simple scrap of paper. In this case, the most important thing to remember is that the “new” column Expr1015 refers to the original Expr1012 (from Compute Scalar #2) for rows from the first input, and Expr1014 for the second input.

The other four columns, Recr1008 to Recr1011, correspond to the four columns returned by the query.

Index Spool #0

The final operator on the top line of the execution plan is Index Spool #0. Looking at its properties we see that the With Stack property is True, which means that this is actually a “stack spool”. As you can see in the description of this processing type on the Index Spool page in the Execution Plan Reference, this means that it uses a worktable indexed by the first column in the Column List, which supposedly represents the recursion level. This is Expr1015, originating from Expr1012 and always zero for the top line – this bit will get more interesting later on. For now, because all rows have the same value the engine will add a uniqueifier to each row in the worktable to ensure uniqueness and to guarantee that rows can be retrieved in the correct order.

The logical operation is Lazy Spool; this means that it returns its input rows immediately, after storing them in the worktable.

The Output List property says that all five columns returned by Concatenation #1 are stored in the worktable and returned to the calling operator. However we know from the query and its results that only the four columns Recr1008 to Recr1011 are actually returned to the client; Expr1015 is not and is apparently filtered out in the top-left SELECT operator, though this is not exposed in the execution plan.

This construction is due to a limitation of the Index Scan operator: the Output List property defines both the structure of the worktable and the columns returned. In this case the Expr1015 column is not needed by any parent of the operator and hence doesn’t need to be in the output. But it IS needed in the worktable (we’ll see why further down), and hence needs to be in the Output List property.

Top line condensed

Taking all the operators of the top line together, we can see that each call goes all the way from SELECT to Clustered Index Seek #4, which reads and returns a row that then propagates all the way back to SELECT and to the client. This process repeats a total of 14 times, reading and returning the 14-row result set of the anchor part of the recursive CTE. If we could pause execution at this time we would see a result set as shown to the right.

The worktable created by Index Spool #0 contains these same rows at this point, but with two additional columns: one for Expr1015 (equal to zero for all these rows) and one with a four-byte uniqueifier on all rows except the first, to ensure that the clustered index of the worktable keeps these rows in place. (See this article for more details on the exact internal storage of the various spool types).

After returning the 14th row the execution once again propagates all the way through to Clustered Index Seek #4, but this time it returns an “end of data” signal because no more rows satisfy the conditions in its Seek Predicates and Predicate properties. The two Compute Scalar operators simply relay this “end of data” signal to their parents. Once Concatenation #1 receives it, it responds by moving to its next input, calling Assert #5.

The bottom part

Most operators when called to return a row start by calling their own child operator for a row. The operators on the second input of Concatenation #1 are no exception, so once Assert #5 is called the first time control immediately passed through Nested Loops #6 and Compute Scalar #7 to Table Spool #8, the second spool operator in this plan.

Table Spool #8

Since this Table Spool has no inputs, it must work as a consumer. This is confirmed by seeing the Primary Node ID property that points to Index Spool #0 as the builder of the worktable that is read by this Table Spool. You can also see that the With Stack property set to true (as is expected– a “stack spool” always consists of a builder Index Spool combined with a consumer Table Spool).

The behavior of the Table Spool part of a stack spool combination is to remove the previously read row from the worktable and then read the row with the highest key value. Remember that the key is on the Expr1015 column which is supposed to be the recursion level, and that the system uses an internal uniqueifier to distinguish between duplicates which means that within each Expr1015 value the last added row will have the highest uniqueifier. So by reading the highest key value, the operator effectively reads the most recently added row within the highest recursion level.

You may also note that the column names in the Output List property are different from the column names in the Output List of Index Spool #0 where the worktable is created. This is normal behavior. The columns are matched by order. So for example, the third column listed in the Output List of Index Spool #0 is Recr1009 (which was the name assigned by Concatenation #1 to the ComponentID column from both the first and the second input). The third input listed in Table Spool #8 is Recr1004, so now Recr1004 contains the ComponentID value.

For this first call, there is no previously read row so there is nothing to delete. The operator reads the row with the highest key value – all rows currently in the worktable have Expr1015 equal to zero, so the last added row will be read and returned to the calling operator. This is the row with ComponentID 994.

The join logic

Table Spool #8 passes this row to its parent, Compute Scalar #7. Its Defined Values property shows that this operator computes a new expression, Expr1014, and sets its value to Expr1013 + 1. Since Expr1013 was 0, Expr1014 will be 1.

The modified row is returned to the parent operator (Nested Loops #6). This operator uses the Outer References property to push several values from its outer (top) input into the inner (bottom) input. These values are then used by Clustered Index Seek #10 to quickly find rows with ProductAssemblyID equal to the Recr1004, which we already saw to be the new name for the column holding the ComponentID. The Clustered Index Seek #10 also has a residual Predicate property for the additional filtering on EndDate IS NOT NULL. Based on my knowledge of the data, I expect that for the current Recr1004 value of 994 this Clustered Index Seek #10 will find two rows; on the first call it will return the first of these two matches, a row with ProductAssemblyID 994, ComponentID 3, and PerAssemblyQty 10.00.

This row is passed through Compute Scalar #9, where Expr1007 is computed as Recr1006 + 1; if you are following along with a full mapping of all column names you will realize that this is the computation for the Lvl column in the query. Nested Loops #6 then combines this row with the current row from its outer input and returns the result (to be precise, only the columns that are listed in its Output List property) to its parent operator.

Assert #5

An Assert operator is used in execution plans to check for conditions that require the query to be aborted with an error message. Based on the query you may not have expected to see an Assert in this execution plan, since the query includes nothing that calls for a potential runtime error.

Let’s check the properties of this operator to see if we can find out why it’s there. As you see, it does a simple check on the Expr1014 column: if this column is larger than 100, the query will end with a runtime error. In the current row this column has the value 1 so processing continues. But seeing this condition should help you realize why it’s there: not because of something I included in the query but because of something I omitted. If you are familiar with recursive CTEs, you will know that by default they support a maximum recursion level of 100. This is to prevent us from ourselves – without it, a coding error that results in infinite recursion would cause the query to actually run forever, using more and more resources along the way. The maximum recursion level can be changed, or removed (at your own peril!). If you add the query hint OPTION (MAXRECURSION 0) to the query, the protection is disabled and the Assert operator is removed from the execution plan.

The place of this operator is very important. The check is not done immediately when the new recursion level is computed, but only after they are joined to matching rows from the BillOfMaterials table. This means that if Table Spool #8 reads a row that already has recursion level 100, Compute Scalar will immediately compute the next recursion level at 101, but execution will not stop. First Nested Loops #6 will execute and look for matches. If there are no matches, then it will never return this row with Expr1014 = 101 to its parent operator, Assert #5, and execution will not end with a runtime error. However, when the Nested Loops operator does find a matching row, then  the combined data is returned to Assert #5 and execution stops before this row is ever returned to the client (or even stored in the worktable).

Back to the top

The row that was just produced by the second input of Concatenation #1 is now passed through that operator to Index Spool #0, which adds it to the worktable and then passes it to the client. The fifteenth row of the result set is now added to the data in the results pane of SSMS. This row is in the worktable as well. Because Expr1015 was set to 1 for this row, it is added at the end of the worktable (based on the logical structure imposed by the clustered index – we obviously have no idea and do not care on which page within tempdb this row is stored).

Rinse and repeat

After processing the fifteenth row the operators once more call each other, and this time the call propagates to Nested Loops #6 quickly. But here the path changes. The last thing Nested Loops #6 did was request and return a row from its inner input. It will now do that again, so control is passed through Compute Scalar #9 to Clustered Index Seek #10. The second row with ProductAssemblyID 994 is now read and returned – this is a row with ComponentID equal to 525. This row is then processed exactly the same as the previous row. So at the end, this is the sixteenth row in both the results and the worktable.

If we were to pause execution at this point, the result set would look as shown to the right. The worktable would be equal, though with the addition of Expr1015 (0 for the first fourteen rows, 1 for the last two)) and the hidden uniqueifier to ensure that the rows remain properly ordered.

For the 17th call the control with pass through the operator in the same way is before. But now Clustered Index Seek #10 finds no more rows to return; its “end of data” signal is relayed trough Compute Scalar #9 to Nested Loops #6, which then requests the next row from its outer (top) input.

This call flows through Compute Scalar #7 to Table Spool #8. As explained above, the Table Spool of a stack spool combination first removes the row it previously read, row 14 in the picture above. After removing this row it then proceeds to fetch the row with the highest key value, which is row 16 in the screenshot above. For this row, Expr1014 is computed as 2 (because it was stored with Expr1015 equal to 1), and then the data is passed to Nested Loops #6. However, since there are no rows in the base table with ComponentID 525, the inner input of this Nested Loops operator returns nothing so Compute Scalar #7 and Table Spool #8 are invoked again.

Now row number 16 in the picture above is removed, since this is the row last processed. After that, the row with the highest key is row 15 in the screenshot, so now this row is passed to Nested Loops #6. And this time it will get matches from its inner (bottom) input, so now more rows will be produced, stored in the worktable (again at the end), and returned to the client.

The same process repeats, over and over again, until all 89 rows of the result set are produced. Over and over again, Table Spool #8 removes the row it read last from the worktable and then reads the row with the highest key value, which is always the row added last. And each time that row actually produces new results in the join, they are added, at the end of the worktable, by Index Spool #0.

Eventually, Table Spool #8 will read the last remaining row from the worktable. After that, either new matches are found and added to the worktable so the process can continue, or no new matches are found. In that last case, Nested Loops #6 will one final time try to get a row from its outer (top) input. Table Spool #8 will remove that final row that it just read, and then see no rows left for it to read and return. It will send an “end of data” signal which then propagates all the way through all operators to the top-left SELECT, and execution of the query stops.

Stack spool

Some of the readers might already be familiar with the concept of a stack. For those who are not, here is a short explanation.

If two elements within a system exchange data and the elements may sometimes produce more data than can be handled immediately, there are two basic structures for exchanging that information: queue-based or stack-based.

The queue-based system, also known as FIFO (“first in first out”) gets its name from the British concept of queueing politely, without elbowing in front of other people. The first to arrive at a bus stop will stand there, the next gets in line behind them, and when the bus arrives the passengers board in the order in which they arrived. A queue within information systems is used to ensure that data is processed in the same order it is produced.

The stack-based system, also known as LIFO (“last in first out”) gets its name from the concept of stacks of plates in a self-service restaurant. When a customer arrives they’ll grab the top plate of the stack. When the dishwasher brings a new plate (or more typical a bunch of new plates) they are simply put on top of the stack. The next customer will grab the plate that was added the most recently. A stack within information systems is used when the data that was added last needs to be processed and removed from the stack first.

Not quite a stack

The combination of an Index Spool and Table Spool, both with the With Stack property set to True, is called a stack spool. That is because it conceptually implements a stack that ports data produced in one part of the execution plan to another part where it is processed. However, it is not a 100% true stack. Official stack implementations would remove a row as it is processed, not after it has been processed and more rows are added. That would be like a restaurant customer adding food to their plate while the dishwasher already stacks now plates upon it, and then the customer will have to get the plate (with the now squashed food) from within the stack while leaving the plates that were added later on top of it.

I am not sure why Microsoft chose to implement this the way it is. After all, it’s probably even easier to build a system that removes the row immediately. However, I do have two theories:

  1. Perhaps there are some scenarios with a very complex recursive query where the same row needs to be read multiple times from the same worktable. This is of course not possible if the row is already deleted. However, I have not been able to set up any such scenario; most of my attempts were rejected for violating the rules of recursive CTEs.
  2. The more likely explanation is that operators in execution plans are designed to be as efficient as possible, and that one such optimization is that operators that read data from a table (including worktables) do not actually copy the data from the buffer pool to working storage, but just use pointers to the location within the buffer pool. If this were done with data read from the worktable of a stack spool, then deleting that row as it is read would mark that area of the buffer pool as free for reuse – and now there is a risk of other processes overwriting the data when it is still needed.

Conclusion

Recursion may sound like a simple technique, but in reality it rarely isn’t. In this blog post we looked at all the nuts and bolts that come together to implement recursion in the execution plan for a query with a recursive CTE.

The absolute worker bee is the combination of an Index Spool and a Table Spool, both with the With Stack property present and set to true. This combination is also known as a stack spool, because it implements a stack that allows rows to be stored in one place in the plan and then read and processed in another plan using the so-called “LIFO” (last in first out) method. The last row added will always be read and processed first; the first row added will be read last. Understanding this mechanism helps to explain things like, for instance, the order in which rows are returned from a recursive CTE. However, as always, the order of the results in a query is only ever guaranteed if an ORDER BY clause is used – so please don’t go and build logic that relies on the order described in this post. It is subject to change without notice if Microsoft decides to implement a different way to evaluate recursive queries.

But the role of the various Compute Scalar operators should also not be underestimated. In these, the recursion level is computed. It is zero for the anchor query, because no recursion is used in that part. For rows produced in the recursive part, it is set to the recursion level of the input row plus one. We saw this recursion level being used by an Assert operator, to ensure that execution stops when the maximum recursion level is exceeded. But the more important usage of the recursion level is that it defines the clustered index for the worktable. That’s why the computation of the recursion level will not be removed it you disable the maximum recursion check.

I hope to be able to continue this series. Whether the next episode will be ready within a month, I do not know. But I will work on it. However, I am starting to run out of ideas. There are still some notes on my shortlist but I do need reader input here as well! If you see an unusual and interesting pattern in an execution plan, let me know so I can perhaps use it in a future post!

Tags:

Plansplaining, part 8. To join the impossible join

This is part eight of the plansplaining series. Each of these posts takes an execution plan with an interesting pattern, and details exactly how that plan works.

In this post we look at a query that, given the known limitations of the available join operators, appears to be impossible to execute. But it’s a legal query so it should run. And it does. But how?

Sample query

The query below can be executed in any version of the AdventureWorks sample database. Don’t bother understanding the logic, there is none. It is merely constructed to show how SQL Server handles what appears to be an impossible situation.

If you look at the descriptions of the various join operators in the Execution Plan Reference, you will see that this query poses the optimizer for what appears to be an insolvable problem: none of the join operators can be used for this query!

The Nested Loops algorithm can only be used for inner join, left outer join, left semi join, or left anti semi join. A right outer join or right [anti] semi join in the query can still use Nested Loops if the optimizer can reverse the inputs. But this query calls for a full outer join, which is simply not supported by Nested Loops.

Both the Merge Join and the Hash Match algorithms require at least one equality predicate in the join criteria. This query has just a single join criterium and it is an inequality. So these algorithms are also not eligible for this query. And Adaptive Join (not yet described in the Execution Plan Reference) chooses between the Nested Loops and Hash Match algorithm at runtime, so it suffers from the restrictions of both.

The query is legal syntax and should run. And it does indeed. Let’s look at the execution plan to find out how the optimizer solved this conundrum. (As usual, I have edited the NodeID numbers of each operator into the screenshot for easy reference).

Apparently, trying to make SQL Server do the impossible merely results in making SQL Server do a lot of extra work. Instead of two scans and a single join, as we would expect in most cases of a single two-table join query, we now see four scans, two joins, and some additional extra work. Time to break it down and see what’s going on!

Concatenation #0

The concatenation operator has a quite simple operation. It has two or more inputs and a single output. All it does it request rows from its first input and pass them unchanged; then if the first input is exhausted repeat the same for the second input; and so on until it has processed all of its inputs. In short, it has the same function in an execution plan that UNION ALL has in a T-SQL query, except that UNION ALL always concatenates the results of two queries, and Concatenation can concatenate as many inputs as needed.

In this case there are exactly two inputs, so we can immediately see that the execution plan consists of two independent sub-trees, and their results are concatenated and then returned to the client.

The top input

The first input to the Concatenation operator consists of three operators: Index Scan #2 which reads columns DepartmentID and Name from the table with alias d1; Clustered Index Scan #3 which reads columns DepartmentID and GroupName from the table with alias d2; and Nested Loops #1 to join them together, using logical join operation Left Outer Join.

I am not going to spend much time on this segment of the execution plan. It does exactly what it appears to do: read the two input tables and produce a left outer join result of the data. If you change FULL OUTER JOIN in the original query to LEFT OUTER JOIN, you will get an execution plan that looks exactly like this segment.

The observing reader might already have a clue at this time what is going on. Just think about the definition of what a full outer join should return – basically, the matching rows joined as normal (same as an inner join), plus the unmatched rows from the first table (with nulls for columns from the second table), plus the unmatched rows from the second table (with nulls for columns from the first table). A left outer join, as implemented here, returns exactly the first two items from this list.

The bottom input

Since the first input for Concatenation #0 already returns two of the three items from the list above, the second input should be expected to return the rest: rows from the second table (d2) that have no match in the first (d1). That sounds like a NOT EXISTS in T-SQL terminology, or a left anti semi join in execution plan terms.

Looking at the second input, you can indeed see elements of this expected left anti semi join, but there are a few extra operators as well. So let’s break this down in its parts.

Clustered Index Scan #6

This operator reads DepartmentID and GroupName from the table with alias d2.(This can be seen by reviewing its properties, specifically the Output List property – I decided to not include screenshots of all property lists in this post, merely the more interesting ones).

No big surprises here.

Nested Loops #5

As you can see in the properties, this operator uses logical operation Left Anti Semi Join. This means that it looks for rows in the second input that match a specific condition; if it finds a match the row from the first input is discarded and the scan of the second input is aborted; if the scan completes without finding a match the row from the first input is passed. This is equivalent to using the NOT EXISTS keyword in a query.

The properties of this operator reveal that there is no Outer References property. This implies that the inner (bottom) input is “static” – it returns the same set of rows for every row from the outer (top) input. For each of these rows the Predicate property is evaluated to determine whether the row is a match of not.

This predicate reads (simplified) d2.DepartmentID > Expr1002. This is similar to the predicate in our query but d1.DepartmentID has been replaced by Expr1002. Why?

Clustered Index Scan #8

As expected, this operator reads the table with alias d1, returning only the DepartmentID column (which makes sense because for this part of the query the Name column from the d1 table is not needed).

The optimizer has set the Ordered property to false, so the storage engine is free to return the rows in any order it sees fit. (The same is the case for all other scan operators in this execution plan but here it is a bit more relevant as we will see later).

Stream Aggregate #7

The properties of this operator (no screenshot shown) reveal that there is no Group By property. As explained on the Stream Aggregate page in the Execution Plan Reference, this means that it functions as a scalar aggregate. It returns a single row with the aggregation result of its entire input.

What type of aggregation is used can be seed in the Defined Values property (see screenshot). As you can see, this is where the Expr1002 column we saw referenced in Nested Loops #5 originates from. Simplified by removing unneeded four-part naming, parentheses, and brackets, the expression reads: “Expr1002 = MIN(d1.DepartmentID)”.

Nested Loops #5 (again)

Now that we have seen both inputs and the properties of the join operator itself we can bring these elements together to understand how it works.

All rows from table d2 are read. For each of these, the bottom input returns a single row with the lowest value of d1.DepartmentID, which is then compared to d2.DepartmentID to determine whether or not the row from d2 should be returned.

Since this is a left anti semi join, with a predicate d2.DepartmentID > d1.Department, the transformation to d2.DepartmentID > MIN(d1.DepartmentID) is indeed correct; it does not change the results in any way. It only affects performance. For the query as written, the bottom input of Nested Loops #5 (which executes 16 times) reads 16 rows, finds the minimum value, and then compares that single value to d2.DepartmentID in the join operator. Without the Stream Aggregate operator, it would instead read the same 16 rows, return all of them to the join operator and compare them there. Apparently, the optimizer still estimates that aggregating the data first is cheaper. (I personally doubt this – after all, without the aggregation the lower input would actually stop executing as soon as a match is found; in this case that would almost always be after reading the very first row. But even in other cases it is reasonable to assume that there is at least a non-zero chance of having to read less rows if you compare each row and stop after finding a match, as opposed to first reading and aggregating all rows and only then doing the comparison).

Compute Scalar #4

The final operator to discuss is the Compute Scalar. Looking at its properties, it is easy to see what it does: it simply adds an extra column to each row returned from Nested Loops #5; this column is called (simplified) d1.Name and is set to null.

This makes sense. The Nested Loops operator returns only a single column, d2.GroupName. But Concatenate expects all input data streams to return the same amount of columns, so the equivalent of the d1.Name column must be added before the stream can flow into the Concatenate.

Bringing it together

As already mentioned above, the Concatenation operator effectively combines the results of two independent queries. The first, shown on top in the execution plan, performs a left outer join between the two inputs. This returns two of the three elements that the query asks for: al matching rows (joined as normal), and all unmatched rows from the first table. The bottom input of the query then finds only the unmatched rows from the second table, which the Concatenation operator then adds to the results.

Effectively, the execution plan can be thought of as implementing this query, which is semantically equivalent to the original query though more complex:

If you execute the query above and look at its execution plan, you will see that this query does indeed have the exact same execution plan as our original query.

Optimization

It may appear wasteful to do two joins and four scans (plus the extra aggregation and concatenation) instead of a single join and two scans. But as explained in the introduction, I forced the optimizer to find a workaround because I submitted a query that simply could not be executed as a single join between two tables, due to limitations in the available algorithms. So in that sense, this is really the best the optimizer could do.

That being said, there certainly is room for improvement. I already mentioned above that I doubt whether aggregating all rows to do just a single comparison in the Nested Loops operator is actually more effective then returning individual rows – especially because a Nested Loops with logical operation Left Anti Semi Join can stop calling rows from the inner input after the first match is found.

There are two other optimizations possible that in this case were probably not done because of the small table size (16 rows for both input tables), but that I have actually seen in execution plans for similar queries on much larger tables. The first is based on the observation that Nested Loops #5 uses a static inner input. This means that it will be the same for each execution – so why compute it over and over again? Why not add a Table Spool operator so that the actual computation of the MIN(d1.DepartmentID) is done only once, storing the result stored in a worktable which can then produce it again every execution without having to recompute it? And also, since DepartmentID is the clustered index key, why not change the Ordered property on Clustered Index Scan #8 to true and add a Top operator, so that only the first row (which by definition stores the lowest DepartmentID value) is returned?

However, my biggest disappointment is that the optimizer even chose to use a static inner input at all. I really would have expected to see a dynamic inner input on Nested Loops #5. The Outer References property would have been set to d2.DepartmentID. And the entire inner input would have been replaced by a single Clustered Index Seek operator on table d1, with its Seek Predicates property set to d2.DepartmentID > d1.DepartmentID. Each execution of this inner input would then navigate the index to the first matching row and either return nothing (if no match exists) or return the first match – and then not be called again so any other matching rows are not returned.

Long story short, I would have preferred to get the execution plan that it is actually produced for the (equivalent) query below:

Conclusion

All join operators have their specific limitations. Due to these, it is possible to write a query that at first glance appears impossible to run. But as long as it is a syntactically correct query, SQL Server has to be able to run it.

In this plan we looked at the hoops SQL Server has to jump to in one of such cases, where the inequality-based join predicate forces a Nested Loops join, but the query ask for a join type that is not supported by Nested Loops. We saw that in this case the query effectively gets rewritten to an equivalent but more complex query that can use supported operators.

I am still open for suggestions on topics to cover in this series. If you see an unusual and interesting pattern in an execution plan, let me know so I can consider using it in a future post!

Tags:

Plansplaining, part 7. The Constant Scan that returns no data

This is part seven of the plansplaining series. Each of these posts takes an execution plan with an interesting pattern, and details exactly how that plan works.

In this post we look at a deceptively simple query: a simple SELECT with an ISNULL to show either a row returned or a placeholder value. And yet there is more going on under the covers than one might expect.

Sample query

The query below can be executed in any version of the AdventureWorks sample database. It returns a code string representing the version number on the 2016 and 2017 versions; on older versions it instead returns the hardcoded text “Older version”.

So what do we have here? A straightforward single-table query with a very simple predicate, embedded in an ISNULL function call. You would probably expect some scan or seek plus a Top operator to evaluate the subquery, plus a Compute Scalar for the ISNULL. In reality the plan is not quite as simple, as shown below. (As in earlier episodes, I have added the NodeID numbers into the execution plan to make it easier to reference individual operators).

We do indeed see a Top (#3), a scan (#4), and a Compute Scalar (#0), but there are also two extra, unexpected operators: Nested Loops (#1), and Constant Scan (#2). Why are those in the execution plan?

Constant Scan #2

In part 3 of this series I write: “Whenever we see a non-obvious Constant Scan operator, out first stop should always be its properties, specifically the Output List and Values properties”. Well, this Constant Scan is definitely not obvious, so let’s look at its properties.

Now there is an Output List property, but it is completely empty. And there is not even a Values property at all. That’s … odd! Does this mean that this Constant Scan returns nothing? Not exactly. A Constant Scan always returns at least one row. It normally returns one or more rows, with columns as defined in the Output List property and values as defined in the Values property. But when those properties are missing, it will still return a single row.

That single row is special. It is a type of row that we would never be able to use in our queries; it can only exist in intermediate phases in an execution plan, because it has zero column. This can be seen in the properties if you look at the Estimated Row Size property, which is 9 bytes. That is exactly the size of the row header within execution plans, a small area where metadata of each row is described (very similar to the row header you will find on data pages for rows stored in tables). The Actual Number of Rows property does indeed confirm that a single row was actually returned by this operator.

If you look at many execution plans you will notice this usage of Constant Scan more often, usually combined with a Compute Scalar. Constant Scan cannot do computations, but Compute Scalar needs an input row to start working and to store its result in; the combination of the two works fine: Constant Scan generates an “empty” row and Compute Scalar then adds data to it.

In this case there is no Compute Scalar to the direct left of Constant Scan though. So let’s see what happens next.

Nested Loops #1

If you read the previous parts you already know that execution doesn’t start at Constant Scan. It starts at the far left at SELECT, which calls Top; Top then calls Nested Loops, which calls Constant Scan. As soon as Constant Scan returns its first row (we already know that it’s the only row but Nested Loops does not know or care about that yet), Nested Loops calls into its inner (lower) input by requesting a row from the Top operator.

Top #3 and Clustered Index Scan #4

This part of the execution plan is pretty obvious so I’m not going to cover it in detail. When Top is called, it first calls the Clustered Index Scan operator. This operator has a pushed down search predicate for [Database Version] >= N’13’. Since the AWBuildVersion table in AdventureWorks always contains exactly one row, this scan will either return that row, or no data at all (in which case an “end of data” signal is returned instead).

If a row is returned, Top will pass that row unchanged to Nested Loops. If no row was returned from Clustered Index Scan to Top, then Top will obviously also not return a row to Nested Loops.

Nested Loops #1 (continued)

At this point we know that the Nested Loops operator was called, it then first called its outer input, Constant Scan, to receive an empty row; it then proceeded to call its inner input, Top, from where it may or may not (depending on the version of AdventureWorks) have received a row.

The next thing to do is to check the join condition (in the Predicate property). In this case that property is not present, which means that every row is considered a match. So if Top returns a row, it is considered a match for the empty row returned from Constant Scan and a row is returned to Compute Scalar; if Top returns no row then it is not a match but because the Logical Operation is a Left Outer Join it will still return a row, using Null values for columns originating from the inner input.

The Output List shows that the row returned contains just a single column: [Database Version]. On a newer version of AdventureWorks, this will indeed be the database version from the AWBuildVersion table; on older versions the inner input returns no row so the Nested Loops will provide a Null value in this column.

The Compute Scalar operator (#0) uses a simple isnull expression to replace Null values with the ‘Older version’ text from the query while letting the actual database version on newer builds pass unchanged. (Because of this simplicity I decided not to add a screenshot for this). This row is then returned to the client. Because the client does not know that this will always be a single-row result, it then calls Compute Scalar again, which calls Nested Loops. At this point there are two options.

On newer versions, the inner input of Nested Loops returned a row, so Nested Loops calls it again to check if there are more matching rows. The first operator in that branch is Top, and its Top Expression property is (1). Since it already returned a row it will not bother to call the Clustered Index Scan operator again but immediately return “end of data”. At that point Nested Loops will switch back to the outer input (the Constant Scan operator). But because Constant Scan has no Values property, it will only return a single empty row, not multiple – on this second call it will return “end of data”, so now Nested Loops also returns “end of data” which then propagates through the rest of the plan.

On older versions of AdventureWorks, the inner input had already returned “end of data” after the first call, so in this case the Nested Loops operator will immediately return to the outer input. Obviously with the same end result.

But … why?

All the above explains how the execution plan for this query works. But it doesn’t really explain why these extra operators were needed. Wouldn’t SQL Server have been able to return the same result by simplifying the execution plan?

The picture above is not captured from SQL Server; I created it using copy and paste on the screenshot of the original execution plan, to illustrate the type of execution plan many people would have expected for our sample query. Why was this much simpler execution plan not chosen?

If you somehow manage to force SQL Server to run the above execution plan in an AdventureWorks2016 (or 2017) database, you will actually get the expected result. And that makes sense if you follow along: operators call each other left to right until the Clustered Index Scan starts, which finds a row to return. Top passes that row because its only task is to stop execution AFTER the first row. Compute Scalar then applies the isnull expression and your database version is returned; Top then ensures that no second row is returned even if more data would exist in the AWBuildVersion table.

But this same execution plan on AdventureWorks2014 (or older) would behave differently. The Clustered Index Scan operator would not find a row to return. This means that “end of data” flows from the scan through Top, Compute Scalar, back to SELECT and you would get an empty result set instead of the single row with the text “Older version”.

At this point you might be wondering “but what about the isnull expression in the Compute Scalar operator?” Understandable. But here is a very important thing to remember: almost all operators in execution plans operate only on the rows they receive. Scan and seek operators receive rows from the storage subsystem; Constant Scan receives rows from its own properties; all other operators receive rows from their child operators in the plan. The only exceptions I know of are a Constant Scan with no Values property (which returns a single empty row), and a Stream Aggregate operator with no Group By property(which still needs an input but returns a single row if its input is empty).

Compute Scalar is not an exception. Compute Scalar computes its expressions for each row it receives, adds the computed values and then returns that row. If it never receives a row, then it will not compute its expressions, because there is no input data to use in those expressions and no row to add these values to. That’s why the faked execution plan above would not return the expected result.

Conclusion

A Constant Scan operator normally returns one or more rows with one of more columns of data, as defined in the Output List and Values properties. However, Constant Scan can also be used with an empty Output List and no Values property. In that case it will return a single row that has no columns.

This empty row can be used to generate a placeholder that other operators (often Compute Scalar) can then store values in. It can also simply be used to ensure that other operators actually receive a row of data, since that is for most operators the only way to get them to do any actual work.

The example in this blog post combined both: the Nested Loops operator needed a row from its outer input to make it start reading from its inner input, and then the data returned from that inner input was added to that empty row and passed to other operators to do the rest of the work.

I am still open for suggestions on topics to cover in this series. If you see an unusual and interesting pattern in an execution plan, let me know so I can consider using it in a future post.

If no suggestions come in, then episode 8 will probably focus on recursion.

Tags:

Plansplaining, part 6. Aggregates with OVER.

This is the sixth post in the plansplaining series. Each of these blog posts focuses on a sample execution plan that exposes an uncommon and interesting pattern, and details exactly how that plan works.

In the first post, I covered each individual step of each operator in great detail, to make sure that everyone understands exactly how operators work in the pull-based execution plans. In this post (and all future installments), I will leave out the details that I now assume to be known to my readers. If you did not read part 1 already, I suggest you start there.

In this post I will take a look at a simple usage of the OVER clause with a not-so-simple execution plan. This particular usage of OVER has been available since SQL Server 2005. It allows us to mix detail and aggregated information in a single query. And I’ll have to admit that this particular execution plan pattern has had me baffled for more than a year before I finally understood exactly how it works.

Sample query

For this example I once more use the AdventureWorks sample database. I tested this on SQL Server 2017, but I have seen the same execution plan pattern in all versions since at least SQL Server 2012.

The query above produces a very simple report, showing salespersons and their Sales YTD figure, compared to the total Sales YTD for all salespersons in their territory. The latter figure is computed by the OVER clause, as documented here.

When SQL Server 2005 released, everyone was excited about the shear simplicity of this solution, which previously required a correlated subquery. But a simple query does not necessarily equate to a simple execution plan, as you can see below. (As in the previous posts, I have added the node ID of each operator in the figure for easy reference in the rest of the text).

This appears to be an outrageously complex execution plan for such a simple query. The query reads from a single table, with no filters, no joins, and no ORDER BY. And yet the execution plan uses two join operators and a sort, plus a total of three Table Spool operators.

Time to jump in at the deep end and see what is going on in this execution plan!

Let’s start at the very beginning

… And that beginning is in this case the far right of the top branch of the execution plan. Obviously, just as every other execution plan, this plan really starts executing at the top left. The SELECT pseudo-operator calls Nested Loops #0 requesting a row. Nested Loops then requests a row from Table Spool #1 and this continues until the far end of the upper branch: Clustered Index Scan #4. But this is where the real action starts, as this operator reads a row from the SalesPerson table and returns it. The Ordered property is set to False because the optimizer does not care about order. No surprise, since the rows are immediately fed into a Sort operator (#3).

After receiving this first row, Sort #3 will store it in memory and then immediately call Clustered Index Scan #4 again. This repeats until the entire table has been read, delivered to Sort #3, and stored in memory. (I am not going into the details of spills at this time). Once all data is memory, the operator sorts the rows.

The properties of Sort #3 shows that this sorting is done by TerritoryID. It is not a far stretch to assume that this is done to prepare for calculating the total SalesYTD for each territory, as specified in the PARTITION specification of the OVER clause. However, at this point this is an educated guess. I need to keep in mind while inspecting the rest of the plan that I might be wrong.

All this work, fetching all 17 rows and sorting them by territory, has to be done first. Once done, Sort #3 can finally return the first row to Segment #2. That’s why Sort is considered a blocking operator. If the input of the Sort is a huge tree of joins between large tables that takes 5 minutes to evaluate, the blocking behavior means that 5 minutes will pass before the first row is returned from the Sort. Later requests require less work: once the sorted results are in memory, each subsequent call can immediately be returned.

The sorted rows are delivered to Segment #2. I recently blogged about this operator so I will not spend a lot of words in this. Its property list shows that it is adding a segment column Segment1002 which marks every row where the TerritorySales is not equal to the previous row. This, too, appears to be related to the PARTITION specification in the OVER clause (but the same caveat applies).

Table Spool #1

So far, I have not covered anything special or unusual. It may not yet be perfectly clear why this work is being done, but it’s not hard to see what these operators do.

The actual fun starts at Table Spool #1. We have seen a Table Spool before but I need to provide a bit extra context here. According to the available documentation and other resources, Table Spool has two logical operations: Lazy Spool and Eager Spool.

An Eager Spool operation is blocking, just as the Sort we saw before. It first consumes its entire input and stores it (in a worktable, unlike Sort which uses memory). After that, it returns those rows, one by one, to its parent operator. The operator can later be initialized again, with the request to produce the same set of rows (“rewind”). In that case it will produce them by reading from the worktable. It can also be initialized again as a rebind, meaning that the worktable is emptied, and it again requests rows from its child, stores them, and returns them after processing the entire input.

A Lazy Spool has the same basic functionality of storing the rows it reads into a worktable, from which it can then later return the same rows again. Unlike an Eager Spool though, a Lazy Spool is not blocking. It will not try to fill the entire worktable first. It simply requests a row, stores it, and returns it; then repeats that for the next row. The rebind and rewind functionality are exactly the same.

This Table Spool is marked as a Lazy Spool, so it is non-blocking. The properties show a single execution, which is a rebind (which is always the case for the first execution). There are no rewinds, so you might wonder why the optimizer adds the extra work of storing all rows in a worktable when they are not processed a second time. We’ll see that later.

The missing link

However, the actual catch is that the described Lazy Spool behavior does not actually explain how this execution plan works. One essential piece of the puzzle is missing. It took me well over a year (and for the record, I was not investigating on a daily basis!) before I finally realized what I missed. The key clue for this was in the rowcounts, as shown in the screenshot below (edited for readability).

A Table Spool, working in Lazy Spool mode, and executed exactly once, would normally always return exactly as many rows as it receives. That is not the case here. It reads 17 rows, which corresponds to the number of rows in the input data. However, it returns just 12 rows. If you look at the results of the query, you might notice that this number is equal to the number of distinct TerritoryID values plus one.

After looking at multiple other execution plans with this pattern, it is clear that the number of rows going in the Table Spool is always equal to the data size, and the number of rows it returns is always equal to the number of distinct values of the PARTITION BY column (or columns). That can’t be a coincidence!

Segment aware

At this point I realized that the only possible explanation can be that the Table Spool operator is segment aware. As explained in an earlier post, an operator that is segment aware changes behavior if it is bound to a segment column. In plans with multiple Segment operators it can be a challenge to figure out which segment column it is bound to because this is not in any way represented in the execution plan. But in this simple case there is only one option. The Table Spool operator can only be tied to the segment column produced by Segment #2.

When tied to a segment column, Table Spool works in a way that falls somewhere in between the Lazy Spool and Eager Spool behaviors. When called, it repeatedly calls its child node to get a row, similar to an Eager Spool. However, instead of processing the entire input, it only processes a single segment. After it receives (and stores in its worktable) all rows for a segment, it then returns just a single row to its parent. This single row represents the entire segment.

So when execution of this query starts and Table Spool #1 is called the first time, it calls Segment #2 to grab a row and store it, then repeats that until a new segment starts. We know from the results that the first segment represents all rows with TerritoryID NULL, of which there are three. Those three rows are read and stored, and then a single row to represent the segment for TerritoryID NULL is returned to Nested Loops #0. That operator then uses its inner (lower) input to produce the full set of three rows for the segment (as will be described below). After that Nested Loops #0 once more calls Table Spool #1, which clears the worktable and then repeats the process for TerritoryID 1.

Nested Loops #0

So far we have established that the outer (top) input of Nested Loops #0 produces one row for each segment/territory. This implies that the inner (lower) input executes once for each segment. Since this operator returns its data directly to the SELECT pseudo-operator, it has to return the final result set. With one row per segment coming from the outer input, this means that the full result set (except, perhaps, the TerritoryID column) has to come from that inner input.

Before diving into the details of that inner input, I want to point out one oddity of this Nested Loops operator. Normally, a Nested Loops operator will have either an Outer References property or a Predicate property. This one has neither.

An Outer References property lists the columns that, coming from the outer input, are pushed into the inner input. Every time these values change, the logic in the inner input ensures that a different set of rows is returned: only the rows that are correct matches for these columns. Because the inner input is custom-tailored, the operator can assume that every row returned is a match; no Predicate property is needed.

When no Outer References property is present, nothing from the outer input is pushed into the inner input. Except for concurrency issues (and one specific case that I’ll address shortly), the same rows will be returned over and over. In this case, the Predicate property is used so the Nested Loops operator can distinguish matching from non-matching rows.

In this case, Outer References and Predicate are both missing. This is normally only seen for a cross join. However, I have looked at the output and I am pretty sure that it’s not eleven identical copies of any data. This execution plan is a very specific case where, without Outer References, each execution of the inner input still returns different data. This is accomplished by the segment-aware operation of Table Spool #1 in combination with the two Table Spool operators in the lower input.

Inner input of Nested Loops #0

For understanding the inner (bottom) input of Nested Loops #0, I will track first execution in detail. This is for the first segment, TerritoryID NULL, consisting of three rows.

When the inner input for this segment starts, Nested Loops #5 is called. This operator then calls its outer input. Compute Scalar #6 calls Stream Aggregate #7, which in turn calls Table Spool #8.

Table Spool #8

In the description of the Table Spool operator above I state that it requests data from its child nodes, stores it in a worktable, and returns this data. Possibly multiple times. But Table Spool #8 does not even have a child operator. So where does this operator get its data from?

This is in fact yet another way that Table Spool can operate. There is, as far as I know, no official name for this. I call it a Table Spool operating in “consumer-only” mode, because it consumes data from a worktable that is produced by another Table Spool. You can also see this in the properties. The presence of a Primary Node ID property indicates that this is a consumer-spool. The value of this property indicates which spool’s worktable it uses. In this case the value is 1, so Table Spool #8 returns data from the worktable that is created by Table Spool #1.

We saw earlier that, for the first segment, Table Spool #1 contains three rows. The three rows in the input table that have TerritoryID NULL. When Table Spool #8 is called it returns the first of these rows. When called again it returns the second, and so on. On the fourth call it returns an end of data signal.

Stream Aggregate #7

Table Spool #8 returns its rows to Stream Aggregate #7. In this execution plan, this Stream Aggregate operator has no Group By column. This means that it produces a scalar aggregate, a single row with aggregated data for the entire set. The Defined Values and Output List properties expose that it computes and returns the number of rows (3) and the sum of the SalesYTD values in these three rows.

The function of this operator in this area of the execution plan is very similar to the Stream Aggregate I discussed in the first plansplaining post, so I will not go into details here.

Compute Scalar #6

The single row with aggregated values for the current segment is then passed to Compute Scalar #6. This operator is not very interesting either. The only interesting property is Defined Values. This shows that the only reason this operator exists is to check how many rows are in the current segment. If this number is zero, the result of the sum of all SalesYTD values is changed from whatever the Stream Aggregate operator returns for an empty set to NULL. This is defined as the correct result of a SUM function on an empty set.

Nested Loops #5

For this first segment (and for every other segment, for this is how Stream Aggregate works when doing scalar aggregation), the outer input of Nested Loops #5 returns just a single row. This row contains the total of all SalesYTD values for the current segment (in this case for the three rows with TerritoryID NULL). This single row then drives the inner input of Nested Loops #5, which will add back the original detail data (as I’ll show below).

There is an intriguing difference between the two Nested Loops operators in this plan. Both need to join their outer input to each of the rows from the inner input. In both cases the Table Spool magic ensures that even though there is no Outer References property, the inner input actually changes to the correct set of rows on each next execution. So both these Nested Loops operators need neither Outer References nor Predicate. And yet, they are not the same.

If you look at the properties of Nested Loops #5, you will see that it actually does have a Predicate property. However, the value of this property is a simple constant value: 1. So where Nested Loops #1 has no Predicate to ensure that each row is a match, Nested Loops #5 uses a Predicate with a constant value that converts to the logic value true to achieve the same effect.

Please do not ask me why the optimizer chooses to use these two different techniques to achieve the same effect.

Table Spool #9

The last operator in the execution plan is Table Spool #9. Looking at its properties (not included in this blog post as it would be very repetitive), you can see that this operator also leeches on the hard work done by Table Spool #1. So this operator, again, reads and returns (upon three consecutive calls) the three rows for TerritoryID NULL that were originally read from the base table and stored in a worktable by Table Spool #1.

The Output List property of this Table Spool shows that it returns the three columns we need from the base table in the final output: BusinessEntityID, TerritoryID, and SalesYTD.

We already saw that the outer input of Nested Loops #5 produces a single row with the total sales for TerritoryID NULL. The inner input then, using this consumer Table Spool, adds the three original rows to that. This in effect produces the first three rows of the result set.

Nested Loops #5 returns these three rows (as always, one at a time) to Nested Loops #1. That operator returns them to SELECT so they can be passed to the calling client..

The next iteration

So far, Nested Loops #0 requested and received a single row from its outer input. After a lot of processing that row was returned; we have seen that this row represents the segment for TerritoryID NULL. Nested Loops #0 then requests rows from the inner input. The first three calls produced rows that are part of the result set. The fourth call results in an end-of-data signal. At this point Nested Loops returns to the outer input to request the next row.

I already described how the entire outer input works. The request bubbles down to the Sort operator. (The Clustered Index Scan operator is not used this time. Sort already has all its data in memory so it simply returns the next row when called). All rows for the next segment, for TerritoryID 1, will be stored in the worktable (after clearing out the previous contents) and a single row to represent this segment is returned to Nested Loops #0.

When that operator then once more requests rows from its inner input, the same operators will do the same thing as before. However, because the content of the spooled worktable has been changed in between, those operators now produce different results. This is how, even without an Outer References property, the Nested Loops operators in this execution plan manage to receive a different set of data for every execution of the inner input.

Loose ends

When it comes to execution plans, I am a sucker for details. (Cue sounds of faked surprise from the readers).

The description above sounds very logical. But does it really match with all the details you can glean from the execution plan? Here is the bad news: it doesn’t. Upon very critical inspection, two issues stand out. One is related to the Output List property of the Table Spool operators. The other is related to the Actual Number of Rows returned by Table Spool #1.

Let’s investigate these loose ends and see if there is a logical explanation.

Output List

The screenshot on the left shows the Output List property of Table Spool #1. The TerritoryID column is expected: the row returned represents a segment, which equates to a single TerritoryID value. The other columns are surprising. Given that there are for example three rows for the NULL segment, which of the three rows are these values taken from? And why are the other rows not used?

Look at the rest of the plan, and you’ll see that Nested Loops #0 receives identical-named columns from its other input. It then returns columns of these names to its parent. It is in my experience very rare to see duplicated column names in execution plans that do not actually refer to the exact same data, but this is an example where this does happen. I must admit that I do not know how Nested Loops #0 picks which of the two BusinessEntityID, TerritoryID, and SalesYTD columns it returns. But I do know that, at least in this case, it always picks the inner input for at least the BusinessEntityID and SalesYTD columns. That is the only way this execution plan makes sense.

This implies that the columns that Table Spool #1 returns to Nested Loops #0 are effectively ignored. Actually returning them appears to be a waste. Why is this done?

Looking at the other Table Spool operators, they all have the exact same Output List. Totally expected for Table Spool #9: this is the spool that adds back the original rows so all these columns are in fact needed. For Table Spool #8, though, only SalesYTD would be needed; none of the other columns are used by or returned from its parent operator (Stream Aggregate #7). Again, an (admittedly small) waste of work to return more columns then needed.

My guess is that this is just a technical limitation of the Table Spool operator. There is no property to define which columns to store in the worktable. It makes sense to assume that the Output List property does double duty for this operator: it defines the columns returned as well as the columns stored in the worktable. That would explain why Table Spool #1 has no other option but to return all three columns, even though they are not used. For Table Spool #8, which consumes the worktable produced elsewhere, a further speculation is needed. My guess is that this is a further limitation, or rather the reverse of the previous limitation: The Output List always has to be equal to the set of columns stored in the worktable.

The extra row

The other loose end is in the number of rows returned by Table Spool #1. In the above explanation, I describe the segment-aware operation of Table Spool as storing the entire segment in its worktable and then returning just a single row to represent that segment. However, there are eleven distinct values of TerritoryID in the data (each of the numbers 1 to 10, plus NULL), and the actual number of rows returned by Table Spool #1 is twelve. Where does that extra row come from?

I must admit that I have not (yet?) figured this out completely. At this point I have two possible theories.

The first theory is that this is just a bug. Table Spool #1 detects that a segment is complete when it reads a row with the segment column set. However, this is actually already the first row of the next segment. This row cannot go into the worktable yet (that would cause incorrect results from the rest of the query), so it has to temporarily store that row in memory. It then first returns a row for the previous segment. When called again, it empties the worktable, stores the row from memory in it, and then proceeds to request the next row.

However, the very first row it reads also has the segment column set. In this case there is no previous segment. Properly coded, this first segment needs to be handled as a special case. But what if this special case was forgotten? It would return a row to represent the (non-existing) previous segment – and no, I do not know what values would be in the columns it returns. That row would then drive an execution of the inner input of Nested Loops #0. Since the worktable is empty, Table Spool #8 returns nothing; this then results in Stream Aggregate #7 and Compute Scalar #6 also returning nothing, so Nested Loops #5 would not even bother to call Table Spool #9 anymore.

All the numbers in the execution plan (both the Actual Number of Rows and the Number of Executions properties of each operator) match up with this explanation. There is but one oddity, though. A scalar Stream Aggregate that reads no rows would normally still produce a single row as its output (just do a SELECT COUNT(*) on an empty table and look at the execution plan to verify). This one doesn’t. The Number of Executions is 12, but the Actual Number of Rows is 11, so it returns nothing in this case. There is nothing in the execution plan, not even in the full XML, that tells the operator to behave different than a “normal” scalar Stream Aggregate. This is the part I do not understand. If anyone does know why this happens, please post a comment and share your knowledge (or even your theory).

The second possible explanation for the 12th row coming from Table Spool #1 is that this is not a bug, but by design. If that is the case, then a Table Spool running on segmented input apparently is designed to always returns one extra row, before returning the rows that represent the segments. That design would only make sense if there are specific use cases where this row is needed and used to perform some specific work. I have not ever seen this in any execution plan I studied, so I have no way to even speculate on the reason for this. If you, reader, have any speculation here, or if you have ever seen an execution plan that gives you reason to believe that extra row is used, please let me know in the comments section!

The rest of the above explanation still applies. The extra row is returned to Nested Loops #0, which invokes its inner input; that inner input ends up returning nothing because Table Spool #8 cannot produce rows from an empty worktable.

Seeing it in action

The video below (double-click to enlarge) visualizes all of the above. Don’t pay attention to the two extra Compute Scalar operators, these are used for the computations I added in order to slow down the processing sufficiently to create this animation.

If you watch the animation, you can see most of what I describe above in action. The first phase is when Clustered Index Scan #4 reads all its rows and returns them to Sort #3. Once done the sorting happens (not slowed down and not visible in the animation) and then the second phase starts.

Due to where I placed the delay, you now miss some of the action as some numbers jump up immediately: 4 rows returned from Sort #3 (all rows for the first segment plus the fourth row, the start of the second segment that tells Table Spool #1 that the first segment is complete), 2 rows returned by Table Spool #1 (the first for the “bug or unknown feature” empty segment, and the first real segment), and you see stuff happening in the lower part of the execution plan. This is in fact already the second execution. Because the first execution processes no rows, it is not affected by the delay and you cannot see it happening. We can infer that it has happened by looking at all the numbers.

After that you see the same process repeat a few more time: a few rows (for a single segment) flowing from Sort #3 through Segment #2 to Table Spool #1, one row returned from Table Spool #1 to Nested Loops #0, and then the inner input of that Nested Loops reading the rows for that segment from Table Spool #8, and then again from Table Spool #9.

Conclusion

This was an extremely long post, even by my standards. If you are here, you either skipped parts (Boo! Hiss!); or you actually read it all (Applause!). Well done!

The main takeaway of this post is that Table Spool is a segment aware operator. When bound to a segment column, it eagerly builds a worktable for a single segment, returns one row to represent that segment, then clears out the worktable and builds it again for the next segment.

In this specific case, other Table Spool operators on the inner side of a Nested Loops join were reading data from that same worktable. Even though the Nested Loops join did not have an Outer References property, so no values were pushed down to change the results returned by the inner input, the ever-changing data in the worktable resulted in the same effect.

Next month will be shorter, I promise. In episode 7, I will look at an execution plan that uses what perhaps is one of the simplest operators in existence, Constant Scan, in a rather creative pattern.

But let me repeat my standard closing paragraph: I do not want this series to be about only my interesting plans. Reader input is greatly encouraged! If you have ever run into a pattern in an execution plan that appeared to make no sense until you dug into the gory details, or even into a pattern that still makes no sense to you, let me know and I will consider it for inclusion in a future post.

Tags:

Plansplaining, part 5. Bitmaps

This is the fifth post in the plansplaining series. Each of these blog posts focuses on a sample execution plan that exposes an uncommon and interesting pattern, and details exactly how that plan works.

In the first post, I covered each individual step of each operator in great detail, to make sure that everyone understands exactly how operators work in the pull-based execution plans. In this post (and all future installments), I will leave out the details that I now assume to be known to my readers. If you did not read part 1 already, I suggest you start there.

In this post I will take a look at bitmaps. Bitmaps are most commonly seen in execution plans for so-called star join queries.

Sample query

The query for this post is very similar to the query we used last month. In fact, the only thing I did was to add two filters to the query; other than that the query is unchanged.

If you run this query in ConsosoRetailDW and inspect the execution plan, you will see that the small change in the query has resulted in a small change to the execution plan. The picture below represents the right-hand side of the execution plan, where the data is retrieved, joined, and filtered. (The left-hand side is responsible for grouping and aggregating, a detailed discussion of that process is out of scope for this post).

If you compare this picture to the one in the previous post, you will see more similarities than differences. Once you start digging into the properties you will see a few more differences but most of the execution plan is actually the same. Which means I can save a lot of time by not explaining the things that are unchanged.

Where are my filters?

Based on the additional WHERE clause in the query, you might have expected to see one or more Filter operators added to the plan. Or the use of Index Seek operators with appropriate indexes on the EmployeeCount and ColorName columns. But there are no such indexes so the optimizer has no better option than the Clustered Index Scan, and the reason that you see no Filter operator is because the optimizer loves to push predicates as deep into the tree as possible.

In this case it was able to push the predicates completely into the Clustered Index Scan (the screenshot shown here is for Clustered Index Scan #7, but #12 is very similar). It is still a scan. The storage engine still has to read all rows from the table – 306 in this case as evidenced by the Number of Rows Read property. But the storage engine itself then checks the predicate, and only he 187 rows that match this predicate are then returned to the Clustered Index Scan operator (which then, as usual, returns them to its parent operator).

If you look carefully, you may notice that the width of the arrows into and out of the two Parallelism operators (#7 and #11) are not the same. In this case you need to look hard to see it, but I have seen execution plans where the difference in width was far more obvious. It appears as if the Parallelism operators return less rows than they receive. That is not the case, though. What you see is the result of a visual bug in Management Studio: when an operator exposes both the Actual Number of Rows and the Number of Rows Read properties, then the arrow width is based on the Number of Rows Read. It represents the amount of work done by the parent operator, but not the amount of data flowing between the operators. Since arrow width normally always represents the amount of rows, I consider this to be highly confusing and I have reported it as a bug.

The bitmaps

If you look back at the execution plan I showed in the previous post, you might realize that just adding these two pushed-down predicates into the two Clustered Index Scan operators would have been sufficient. No other change is needed, and the joining and filtering part of the query would have been very straightforward: read only the small stores and the black products, then read all FactOnlineSales data and join everything. A lot of the rows read from FactOnlineSales would not be for black products or for small stores, so those rows would be removed from the intermediate results after the joins (which are all defined to do the Inner Join logical operation, which removes non-matching data from the results). The end result would be totally correct.

And yet, the optimizer has decided to do something else. It introduces two extra operators, called Bitmap, with “Bitmap Create” as the logical operation. That smells like extra work, and whenever I see something in an execution plan that smells like extra work I want to know exactly what it does so I can understand why the optimizer introduces it.

Bitmap #5 – creating a bitmap

As implied by its logical operation, “Create Bitmap”, the Bitmap operator creates a bitmap. (I assume that this is the only logical operation that the Bitmap operator supports, since I have never seen a Bitmap operator with a different logical operation). A bitmap is a structure that stores Boolean values for a consecutive range of values in a small amount of memory. E.g. the range from 1 to 8000 has 8000 possible values. These can be represented as 8000 bits in just 1000 bytes. For each value, the location of the corresponding bit can be computed by dividing the value by 8; the dividend is the location and the remainder determines which of the bits to use. The value of that specific bit can then be tested, or it can be set to false (zero) or true (one).

The bitmap is named Opt_Bitmap1005 in this case. This name is not exposed in the quick property popup, but you can find in in the full property sheet as shown here, in the Defined Values property. You will also note that this bitmap is not included in the Output List property. That’s because this bitmap is not created for each individual row; there is a single bitmap that accumulates information from all rows. Other operators in the execution plan can reference this bitmap by name, even though it is not passed to them in their input rows.

You will also notice this operator uses hashing (sorry to those who are still recovering from last month…). It has a Hash Keys property, which is set to StoreKey in this case. The hash function here is not the same as the one used in Parallelism operator #6 (see previous post), as the hash function for a bitmap operator needs a much larger range than the eight-value range used to direct rows to threads. It may or may not be the same as the hash function used in Hash Match operator #4; these details are not published by Microsoft and not exposed in the execution plan in any way.

So here is what this operator does. When it is initialized, it creates a bitmap that is large enough for the range of possible results of the hash function, and it sets all bits to false (zero). Then, every time GetNext() is called it reads a row from its input, hashes the StoreKey to find a location in the bitmap, and sets the corresponding bit to true (one). It then passes the row, unchanged, to its parent operator. This repeats until all rows are processed.

Simple example: Let’s say that only three stores in our database have less than 30 employees. These stores have StoreKey 17, 38, and 82. The first row the Bitmap operator receives is for StoreKey 17; it hashes the number 17 and the result is 29, so bit 29 in the bitmap is set to one. It then reads the next row, for StoreKey 38; hashes the number 38 which results in 15; and sets bit 15 to one. The third row it reads is for StoreKey 82; due to a hash collision this number also hashes to 29 so it will once more set bit 29 to one (effectively not changing anything). There are no more rows so the end result will be a bitmap with bits 15 and 29 set to one and all other bits set to zero.

The bitmap is not used here. At this point it appears that SQL Server is doing lots of extra stuff for no good reason at all. Trust me, the reason for this extra work will become clear later.

One final note. I described the above as if this is a single-threaded operation. However, the execution plan runs in parallel. This means that the actual implementation is a bit more complex then suggested here. In fact, I do not even know exactly how a bitmap in a parallel execution plan works. One possibility is that there is a separate copy of the bitmap for each thread; each instance of the Bitmap operator then writes to its own copy without interference from the other threads, but operators that read the bitmap must then combine the bitmaps from all threads. Another possibility is that there is a single copy of the bitmap that is used by all threads at the same time; this necessitates additional complexity to ensure that the threads do not run into race conditions as they update the bitmap, but makes reading from the bitmap a lot simpler.

Clustered Index Scan #15 – using the bitmaps

I will now skip a lot of the steps in the execution plan. The two Hash Match operators, the Clustered Index Scan on the DimProduct table, and all the Parallelism operators, are all already explained in the previous post. And Bitmap operator #10 is doing effectively the same as Bitmap operator #5, except that in this case the hash is computed off the ProductKey column and the resulting bitmap is called Opt_Bitmap1004.

To understand why these two Bitmap operators are included, I will focus on the Clustered Index Scan operator (#15) that actually uses these operators. If you look at its properties, you see that this operator now also has a Predicate property. You would have expected this if the query contained any direct filter on any columns from this table but that is not the case. And the predicate looks different from what you normally see. There are two conditions here. After removing some extra parentheses and brackets and simplifying the tablename references, the second one reads “PROBE(Opt_Bitmap1005, fos.StoreKey, N'[IN ROW]’)”. But what exactly does this mean?

A PROBE function call in a predicate means that the operator has to verify a bitmap for each row. The first parameter specifies the bitmap. In this case it is Opt_Bitmap1005, which was created by Bitmap operator #5. The second parameter indicates which column should be hashed to find a bitmap location. If the third parameter is “IN ROW”, this means that the actual processing of this PROBE check is done within the storage engine, before returning the row to the operator. Without this IN ROW indication, all rows would be returned from the storage engine to the operator and the operator would then check the bitmap to determine whether or not the row is passed.

Earlier in this post I walked through the bitmap creation process with a simplified example using just three stores with less than 30 employees: 17, 38, and 82. This resulted in a bitmap with bits 15 and 29 set to true and all others set to false. Let’s now see how this bitmap is used as some of the rows from the FactOnlineSales table are read. The first row has StoreKey value 29. This value is hashed and the result is 46; the storage engine checks bit 46 in the bitmap and sees that it is false. This row is rejected and not returned to the scan operator; the storage engine immediately moves to the next row. This second row has StoreKey 38. We already know that 38 hashes to 15, and since bit 15 is set this row qualifies. Assuming it also qualifies the second PROBE predicate, it will be returned to the Clustered Index Scan, which then returns it to its parent. This row will now traverse through all the join operators and end up in the aggregation part of the execution plan, as it should.

A PROBE predicate can also produce false positives. Let’s continue our example with the third row, which is for StoreKey 65. There happens to be a hash collision between 38 and 65, so the value 65 also has 15 as its hash result. This row, too, will qualify and be returned (again assuming the second PROBE predicate allows it to pass as well). But this will not result in incorrect results. The row will be joined to some Product data in Hash Match #9, but in Hash Match #4 there will be no row from DimStore that matches StoreKey 65, so the row will still not be returned.

False negatives are not possible. If a FactOnlineSales row has a StoreKey value of 17, 38, or 82, the result of the hash function will always be 15 or 29; these bits are set so these rows will all qualify. And similarly, if they are for black products then they will also qualify the other PROBE predicate, so these rows will certainly be returned,

So what the two PROBE predicates in this query effectively do is: use a hash on both the StoreKey and the ProductKey to find a location in the two bitmaps, then check to see if that location is a hash value with at least one qualifying store or product. All qualifying rows will always match this filter. Some non-qualifying rows might also match, due to hash collisions, so the rest of the query still has to ensure that these are removed. (In this case that happens automatically due to the inner joins; sometimes an explicit Filter operator is used for this). But most non-qualifying rows will be removed in the scan.

As you see in the screenshot above, only 1.9 million rows were returned from the Clustered Index Scan, instead of the 12.6 million that would have been returned if the optimizer had chosen not to use bitmaps in this case. (And in case you wonder about false positives: I checked and there were exactly 9 of them for this quey – that is how infrequent hash collision are in the actual hash functions used).

In the screenshot above you may also notice a confusing Number of Rows Read property. The operator is a scan; it really reads all rows from the table. If you run the query with SET STATISTICS IO ON, you will see a number of logical IOs that matches the number of pages in the table. Trust me on this, the Clustered Index Scan operator #15 really reads all 12.6 million rows that are in this table and the reported number is wrong. I have filed a bug report for this.

Wrong estimate?

The wary reader might have noticed in the last screenshot that there is a huge difference between the estimated and the actual number of rows. The estimate is 12.6 million – that is, it is equal to the number of rows in the table. Looking at the execution plan it appears as if the optimizer estimates that every row in the table will match the bitmaps and be passed on to the parent operators. But based on that estimate, why would the optimizer even bother? If that estimate were correct, the bitmaps would not provide any cost saving, while still introducing extra overhead in the execution plan.

But there is more. If you look at Parallelism operator #13, you will see that this operator has a much lower estimated number of rows: just 1.5 million. But there is no predicate on this operator, so how can the optimizer estimate that this operator, that without predicate would always pass along each row unchanged (just on a different thread) would reduce the number of rows from 12.6 million to 1.5 million?

The answer to these questions have to do with the exact sequence of steps the optimizer takes when creating an execution plan. Most of the decisions are taken during the main optimization phases. But once a plan has been chosen, a final “cleanup” phase runs. And one of the things this cleanup phase does is: pushing PROBE filters down into a scan where this is possible.

So here is what the execution plan would have looked like if the cleanup step had not pushed this probe filter into the clustered index scan (sorry for the switch to the old execution plan artwork, but I spent two hours doctoring up a fake screenshot a few years back and I don’t feel like repeating it).

The Clustered Index Scan reads the entire table and returns all rows, because no filter is pushed down (yet). The estimate matches that (to six digits accuracy). The filter then applies the bitmaps to reduce the rowcount to 1.9 million, with an estimate of 1.5 million, which is within the normal margins of error. The Parallelism operator does not affect the estimated or actual row count at all, as expected.

But the cleanup step does execute. It realizes that the probe in the Filter operator can be pushed into the Clustered Index Scan. And now the Filter itself does nothing so it is removed. And the Microsoft developers who built this didn’t take the extra steps to change the original estimate of 12.6 million rows for the Clustered Index Scan to the 1.5 million rows that were actually estimated to be remaining after the filter. Probably because they figured that the estimated number of rows is only relevant for plan selection, which by this time has already concluded. Maybe someone should tell them that consultants actually look at estimated and actual row count as part of their tuning?

Credit where credit is due: I did not know the above, and I probably never would have found out, until a few years ago Paul White (b|t) explained the process. I do not know if this was in a blog post, in an email, or in any other way. I have tried to find the source (and link to it if possible) but I was unable to. That’s why I decided to reproduce it. If Paul’s explanation of this process is available somewhere, then please indicate this in a comment and I will update this post with a link.

Conclusion

In this post I looked at the execution plan of a query that joins two dimension tables to a fact table, with filters on the dimension tables. Because the filters are not directly on the fact table, normal methods of pushing down predicates work only for the dimension tables. But the fact table is the large table, with millions of rows – this is where you WANT pushdown to happen!

In order to make that possible, SQL Server uses bitmaps. The Bitmap operators that create the bitmaps use hashing to map the values that are needed to a hash value, then set a bit in the bitmap to mark that value as “required”. Now the Clustered Index Scan on the large fact table can use the same hash function on the corresponding columns, check the bit in the bitmap, and thus filter out a huge amount of data very early. There is still the risk of “false positives” due to hash collisions, so the rest of the execution plan still has to implement any filters that the query requires.

There is a cost associated with this pattern. There is some additional startup cost, reason why you will never see this pattern when querying small tables. The Bitmap operators take time, and the bitmap itself uses up memory. Plus, the probe action in the clustered index scan also uses CPU cycles as the hash function has to be executed for each row. But if sufficient rows are filtered out early, then the cost saving in the rest of the plan, where now far less rows are processed, greatly outweigh that cost.

This post illustrates the most common usage of bitmaps (and hence the Bitmap operator) in SQL Server execution post. There are other situation where bitmaps are used. The PROBE function is often used in the predicate of a scan, but I have seen it on other operators as well, so make sure to find exactly where each bitmap is created and where it is then used if you want to understand what a bitmap is used for in each specific plan.

This post was a small diversion from the normal style of my plansplaining posts, because it focuses mostly on a single operator. Next month I will return to the normal style. I will look at a query that uses the OVER clause to get aggregated data and detailed data in a single result set. Each of the operators used in its execution plan by itself is relatively easy to understand, but their interaction is very complicated.

But let me repeat this: I do not want this series to be about only my interesting plans. Reader input is greatly encouraged! If you have ever run into a pattern in an execution plan that appeared to make no sense until you dug into the gory details, or even into a pattern that still makes no sense to you, let me know and I will consider it for inclusion in a future post.

Tags:

Plansplaining, part 4. Let’s repartition the streams.

This is the fourth post in the plansplaining series. Each of these blog posts focuses on a sample execution plan that exposes an uncommon and interesting pattern, and details exactly how that plan works.

In the first post, I covered each individual step of each operator in great detail, to make sure that everyone understands exactly how operators work in the pull-based execution plans. In this post (and all future installments), I will leave out the details that I now assume to be known to my readers. If you did not read part 1 already, I suggest you start there.

In episode #3, I promised that today’s post would be about bitmaps. However, bitmaps work only in parallel plans, and as I was working on that post I noticed that I needed to include a lot of explaining of the parallelism itself. So much, in fact, that just covering the parallelism issues by itself already makes for a lengthy post. So I decided to slightly change the example and focus on only parallelism today. Next month I will really cover those bitmaps!

Hashing explained

A lot of the operators I touch upon in this post use some form of hashing. So here, for those who do not know this already, is a very short (and mostly simplified) explanation of what hash functions and hash tables are.

A hash function is a function that takes any input and converts it to a number in a predefined range, in such a way that the same input always results in the same number. Different inputs will sometimes also result in the same number; this is called a hash collision. Hash functions with a large range are typically optimized to minimize the chance of hash collisions. For hash functions with a small range, hash collisions are unavoidable. These hash functions are typically optimized to result in a balanced distribution.

A hash table is a table where rows are stored, organized by the result of applying a hash function to one or more columns. For each possible result in the range of the hash function, a container for rows (called a hash bucket, or just bucket) is allocated. All rows for which the hash function resolves to 1 are stored in bucket number 1; rows for which the hash function is 2 go to bucket 2, etcetera. Rows with the same values in the input of the hash function will always land in the same bucket. Rows with a different input for the hash function will usually land in different buckets, but may still end in the same bucket if there is a hash collision.

When SQL Server uses hash tables, they are stored in memory. If the amount of data exceeds the allocated memory, tempdb will be used as a temporary holding place for overflow data; this is called spilling and it generates “hash warnings” that are shown on the execution plan and can be caught in Extended Events or Profiler Trace sessions. A detailed discussion of hash table spills is beyond the scope of this post.

Sample query

I normally like to take all my examples from the AdventureWorks sample database. But that database is too small to get parallel execution plans without setting the parallelism threshold to some weird non-representative low value. So I decided that for this post I am going to use one of Microsoft’s other standard demo databases, the ConsosoRetailDW sample database. This will also allow me to add in the bitmap operators in the next episode with only a small change in the query, something I could probably not achieve in AdventureWorks.

If you run this query in ContosoRetailDW, you will see a rather large execution plan. In this post I will focus on the right-hand side, where the three tables are read and joined. (The left-hand side is where the aggregation happens, using a local-global aggregation pattern that I already blogged about in 2016).

Below is what the relevant part of the execution plan looks like on my laptop. (I added the small numbers in red, so I can easily reference each individual operator; they correspond to the Node ID property of each operator).

Since this query is processing over 12 million rows from the FactOnlineSales table, joining them to the two dimension tables and then doing aggregation on them, you will be happy to see this run as a parallel plan. But are you also happy with the four “Parallelism (Repartition Streams)” operators in this plan segment? Do you know what they do and why they are there?

Here is how Books Online describes what Parallelism (Repartition Stream) does:

The Repartition Streams operator consumes multiple streams and produces multiple streams of records. The record contents and format are not changed.

Ummm, yeah. So basically, this operator swaps the rows around between threads in a parallel plan. Rows are in one thread and are now kicked onto another thread. But this does not really explain why this has to be done four times in between just five other operators that are doing the actual work. And considering that the estimated costs indicate that 20% of the total query effort is expended here, you probably want to know if this is really needed.

Step by step

Let’s dive in, try to understand how all the operators in this execution plan cooperate, and see if that explains the Parallelism operators. I’ll start at the top left of the plan section shown above: Hash Match #4. As you probably know, the Hash Match operator runs in multiple phases. The first phase, called “build phase”, runs when the operator is initialized. During the build phase, the operator requests rows from its build input (shown on top in the graphical execution plan) and stores them in memory, in a hash table. However, before rows are even returned to this operator, a lot of things have already happened. So let’s first look at that.

Clustered Index Scan #6

To understand the logic of the build input, I’ll follow the data flow. It starts with Clustered Index Scan #6. By itself, this operator is not really that interesting. It simply reads all rows from the DimStore table and passes them to its parent operator. And, as always, this does not happen by itself, it is pull-based so it will return a row only when asked.

What’s different this time is that this operator is running in a parallel section of the execution plan. And the scan operator is “parallelism-aware”, which means that it changes its behavior when running in a parallel section. (Note that most operators are not parallelism-aware; scan operators, as well as seek operators that do range scans, are really the exception.)

To understand the behavior change, let’s first recall what a scan operator does in a serial plan. When called it returns the next row from the page it is currently working on. When it has processed all rows from a page and is requested to return another row, it reaches out to the storage engine to request the next page, from which it can then return a row. The calls to the operator are once per row, but the operator itself reaches out to the storage engine just once per page.

In a parallel plan, there are multiple copies of this same Clustered Index Scan operator, all working at the same time. Each of them channels their requests to the Storage Engine, and because this is a parallel plan these requests now go through the so-called Parallel Page Supplier. The Parallel Page Supplier simply always hands the next available page to the first thread that requests a new page. Putting this together, the calls to the scan operator are still once per row, but now on each of the individual threads. The calls to the storage engine are still once per page, coming from multiple threads in parallel. And the Parallel Page Supplier coordinates between these threads to ensure that every page is eventually handled by exactly one of the threads.

This effectively results in a demand driven page-by-page distribution of rows among threads. Because it is demand driven, it is impossible to predict what rows will go to which thread, so the distribution of rows among threads can be considered as “sort of random”. The screenshot I included shows how many rows were on each thread when I executed the query. But if I run the query again, the number of rows per thread will be different.

Parallelism #5 – shuffling the rows

This random distribution of data over threads is rarely acceptable. The optimizer usually needs to exert more control over how rows are distributed among the available threads. That’s what Parallelism operator #5 in this execution plan is for. As explained above, this operator reads rows from all the input threads and returns them on the “correct” thread.

Which thread is correct is defined in the plan properties. In this case, the Partitioning Type shows that a hash-based distribution is used, and the Partition Columns property shows that the hash function uses the StoreKey column from the DimStore table as its input. In most cases where an execution plan uses hashing it is impossible to find out any specifics about the hash function used. The Parallelism operator is an exception to that, though. Since the result of the hash function determines what thread the row goes on, and since this query runs on eight threads, this operator has to use a hash function with a range of eight values. On your system, the hash function’s range will be determined by your degree of parallelism.

This operator changes the distribution of rows from (semi-)random to deterministic. No matter how often I run the query on my laptop, the distribution of rows among the eight threads will always be the same. That by itself though is not sufficient reason to introduce this operator. After all, why would I care which of my CPUs processes each individual row, as long as the end result is the same? I have seen situations where a Parallelism (Repartition Streams) operator was added for the sole purpose of flattening out a very skewed distribution of rows between threads, but that is not the case here.

Hash Match #4

To recapture the explanation so far, I have described how Hash Match #4, during its initialization, repeatedly requests rows from its build input. This causes the Clustered Index Scan to read rows from the DimStores table. These are initially assigned to threads in a demand-driven way, but the Parallelism then forces each row to a specific thread as determined by the hashing the StoreKey value, for a reason that should become clear later.

The Hash Match operator itself stores all the rows it receives from its build input in memory, in a hash table. The Hash Keys Probe property shows that a hash function on the DimStore.StoreKey column is used for this. When the build phase is completed, all stores retrieved from the build input are stored in the hash table but no rows are returned yet.

It is important to note that, even though Hash Match #4 and Parallelism #5 both use a hash function on the same column, it is not the same hash function. The Hash Match operator prefers a hash function with a rather large range, in order to minimize the chance of hash collisions. The hash function used in Parallelism #5, with a range of just eight values, would never be chosen for a Hash Match.

The second phase, called “probe phase” then starts when the Hash Match operator receives its first GetNext call, typically immediately after the Initialize call returns. During the probe phase, the operator requests rows from its build input (the bottom input), applies then applies the same hash function as in its build phase but now to the value of FactOnlineSales.StoreKey (as seen in the Hash Keys Probe property). It then goes to the bucket identified by the hash and looks for matching rows in that bucket only. Matching rows are returned; when no more matches are found the next row from the probe input is tried. (And as always, control returns after returning a row, and resumes where it left on the next call).

The same StoreKey value, regardless of which table it is coming from, always results in the same hash value. So the algorithm only has to search for matches in that single bucket. That means it has to look at a few rows only – the “return on investment” for the (rather expensive) creation of the hash table. In most cases the Hash Match has to do an additional test on the join predicate, found in the “Probe Residual” property, to prevent bad matches being returned due to hash collisions. In this specific case there is no Probe Residual because, for integer columns such as StoreKey, the optimizer can find hash functions that are 100% collision free.

However, the description above is for a Hash Match operator in a serial plan. In this case the plan is running in parallel. Does this change how the Hash Match operator works? The answer to that question is “no”. Like most other operators, Hash Match is not “parallelism aware”. So it does not care or even know, that there are seven other copies of this operator, on the other seven threads, doing exactly the same work. Each of the eight instance of this operator simply creates its own hash table, fills it with the data it receives from the its own build input, then tries to find matches in it for the rows it receives from its own probe input thread. And it never sees any of the data handled by the other threads.

At this point you might be wondering how this can return correct results. Let’s say that a row from the FactOnlineSales table with StoreKey 185 is being processed on thread 3. The Hash Match operator that is running on thread 3 will read this row during its probe phase, but it will only find a match if the corresponding row from DimStore was also processed on thread 3. If that row happens to be processed on thread 5, then the Hash Match operator on thread 3 is blissfully unaware of its existence and would reject the probe input as not matching.

Since there are eight threads available, it appears that each match now only has a 12.5% chance of being detected. Obviously, since there are no headlines in the media about hugely incorrect results being produced by SQL Server, the optimizer has found a way to increase this chance to the 100% it should be. But we will if we look at one more operator: the Parallelism operator on the probe input.

Parallelism #7 – more shuffling

It is at this point not relevant how exactly the input for Parallelism operator #7 is produced. I’ll just start at Hash Match #8. Like most of the plan, this operator runs in parallel. And apparently the optimizer is unhappy with how the rows are distributed over the threads, so it has inserted the Parallelism operator to repartition the streams as it sees fit.

As seen in the properties, this Parallelism operator also uses hashing to determine the proper thread for each row, just as its brother #5. And this hashing function also uses a StoreKey as its input, but in this case it is the StoreKey from the FactOnlineSales table.

What is important to know (and you really need to know this, as this is not exposed in any way in the execution plan) is that this Parallelism operator uses the exact same hashing function as its brother, Parallelism #5. This fact is key to understanding what is happening here.

Going back to the example above, I used an example row from FactOnlineSales that has StoreKey 185 and that “happens to be” processed on thread 3. But it is now clear this thread assignment is not just coincidence. When Hash Match #4 asks for a row from its probe input, it is first being reassigned to a thread (in reality Parallelism operators are more complicated then this suggests, but that is out of scope for this post), and this thread is determined by hashing the StoreKey value. This means that all rows from FactOnlineSales that have StoreKey 185 are guaranteed to all go to thread 3. And, even more relevant: the row from DimStore that has StoreKey 185 was also shuffled to another thread, in Parallelism #5. But that operator used the same hash function, which we already know results in 3 when the input is 185, so this row from DimStore is on thread 3 as well.

Back to Hash Match #4

In other words: two parallelism operators (#5 and #7) both shuffle rows between threads. But they don’t shuffle at random. For rows from DimStore, the thread is determined, using some complex function, by their StoreKey. And for rows from FactOnlineSales (or rather, from the results of joining FactOnlineSales and DimProduct), the thread is determined by applying the exact same function to FactOnlineSales.StoreKey. As a result, it is 100% sure that all rows that can match based on the join condition between FactOnlineSales and DimStore are allocated to the same thread.

The Hash Match operator can just do its own thing, in its own world, without having to worry about other instances on other threads. The optimizer has guaranteed, by inserting some additional operators, that Hash Match will always find the rows it needs on its own thread.

The rest of the plan

In the explanation above I have skipped past everything that happens in Hash Match #8 and the operators it calls. But you should now see that a very similar pattern is used here. The build input of Hash Match #8 reads rows from DimProduct, then uses a hash function on ProductKey to assign them to a known thread. In that thread it is added to the hash table by Hash Match #8. The probe phase then reads FactOnlineSales and reshuffles the rows based on the ProductKey in this row, to ensure that these rows are on the same thread where a matching DimProduct row would be. The Hash Match can then find, join, and return the matching rows.

The results of Hash Match #8 will still be on the thread that is determined by hashing ProductKey (from either table,  as they are the same in the joined results). There is no known relationship between ProductKey and StoreKey, so it is unlikely that this thread is the correct thread for Hash Match #4. That’s why Paralellism #7 is needed.

Conclusion

As should be clear now, all parallelism operators are actually needed. Parallelism operators #9 and #11 are needed because both their inputs are distributed semi-random, and Hash Match #6 needs its inputs to be distributed based on ProductKey. Parallelism operators #5 and #7 are needed because one of them has a semi-random imput distribution and the other has an input distributed based on Product Key, but Hash Match really needs both inputs to be distributed based on StoreKey. Leave out just one of these parallelism operators, and the execution plan would produce incorrect results.

The parallelism operators do contribute to the total cost of the execution plan. But that is not unusual in parallel plans. Coordinating the work always takes extra effort. The total CPU usage of a parallel execution plan is almost always higher than the total CPU cost of an equivalent serial plan. But because more threads work on it concurrently, the elapsed time does go down and the query returns quicker. The query shown in this post uses over 16.4 seconds of CPU time on my laptop but completes in less than 4 seconds. When I force it to go serial, the  CPU usage goes down to 9.5 seconds, but the execution time is now also 9.5 seconds.

The overhead of parallelism is not always as explicitly visible as here. But it does always exist and you should be aware of it and make a conscious choice. Is this query important enough that having it finish in 4 instead of 9.5 seconds is worth the extra resource usage (that will hurt all other concurrent workloads)? Or do you value overall performance more, and decide to take the hit on this query? The correct action depends on who executes this query, how often, for what reason, at what time, and what else is running at that time. This is the part of our job where we really need to understand the business!

As promised in the introduction, plansplaining part five will finally look at the Bitmap operator that I already promised for this episode. However, I still like to see your input too! If you have ever run into a pattern in an execution plan that appeared to make no sense until you dug into the gory details, or even into a pattern that still makes no sense to you, let me know and I will consider it for inclusion in a future post. (When possible, please make sure that you can provide either a repro script or a copy of the actual execution plan, saved as a .sqlplan file).

Tags:

Plansplaining, part 3. How repeating work saves time

This is the third post in the plansplaining series. Each of these blog posts focuses on a sample execution plan that exposes an uncommon and interesting pattern, and details exactly how that plan works.

In the first post, I covered each individual step of each operator in great detail, to make sure that everyone understands exactly how operators work in the pull-based execution plans. In this post (and all future installments), I will leave out the details that I now assume to be known to my readers. If you did not read part 1 already, I suggest you start there.

Sample query

Today I will look at a pattern that can be seen when working with partitioned tables. I will not use a standard demo database but create my own set of sample tables to get maximum control. Here is the code to create two partitioned tables and load them with some data. Execute this code in your own “playground” database if you want to follow along.

This may take some time to run (about 2:45 minutes on my laptop). Since it’s a one-time deal, I did not bother to optimize this. Do make sure to run this with the Include Actual Execution Plan option disabled, or you will generate 60,000 execution plans (which will probably crash your SSMS instance).

Once the tables are created and the data is loaded, let’s run a simple query that joins two of the demo tables:

The execution plan for this query is very simple and holds no surprises. There is no WHERE clause and the tables are fairly large, so a Nested Loops join would be a bad idea. A Merge Join would be an option, but the equality part of the predicate is on column a only, which is not unique in both tables; so this would be a many to many Merge Join, with all the associated overhead in tempdb. A Hash Match join is a better choice in this case, and that is indeed what the optimizer picks:

Now let’s see what happens if we make one small change to the query. Instead of joining tables Part_1a and Part_2 (both partitioned but on different partition functions), we’ll now join Part_1a and Part_1b (both partitioned and using the same partition function). Here is the changed query:

 

With basically the exact same query you would probably expect to see the exact same execution plan. But that would not make for a good Plansplaining post, so fortunately we do actually get a different execution plan this time:

Comparing this execution plan to the one above, we see that the exact same pattern of a Hash Match join processing two Clustered Index Scans is used here. But in this case that pattern is on the inner side of a Nested Loops join, which means it executes once for each row in the outer input (which is a Constant Scan operator that we’ll examine in more detail shortly). The properties of the Hash Match operator show us exactly how often this section of the plan was executed:

Upon seeing this, you would be forgiven for thinking that the optimizer has gone absolutely bonkers. Why is the optimizer creating a plan that, by the looks of it, appears to do the same work four times? Granted, since the aggregation asks for a MAX only there will be no concerns over correctness of results in this case. But repeating the same task four times? Really? Why?

The good news is that there actually is a good reason for this. And that the execution plan is not actually doing the same work four times. However, this is not visible by looking at a picture of the graphical execution plan; we need to dig deep into the details to understand what’s going on.

The devil is in the details

Before diving into the apparently repeated section of the plan, let’s first examine the Nested Loops join operator and the Constant Scan on its outer input. The Constant Scan operator is by itself actually already a mystery here. Looking at the query, there is no obvious reason at all to have a Constant Scan in this plan, and yet here it is.

Constant Scan

Whenever we see a non-obvious Constant Scan operator, out first stop should always be its properties, specifically the Output List and Values properties. These expose the data that the Constant Scan returns. Though these properties are exposed in the property popup, the formatting of the Values property makes it hard to parse. The full properties window has the same data in an easier to understand format:

The Output list shows that this Constant Scan will return rows with just a single column, called Expr1005. Note this column name, because we need to find which operators use this column in order to understand the role of this Constant Scan. The Values property shows that four rows will be returned (not surprising, since we already saw that the inner input of the Nested Loops operator executes four times, and that the value of Expr1005 will be 1 for the first row, 2 for the second, then 3, and finally 4. None of these values appear in the query so they make no sense yet. We really need to find where these values are used!

Nested Loops

Looking at the Nested Loops operator, we see a few interesting things. We know that it receives four single-column rows from its outer input, and then executes its inner input for each of these rows. Let’s look at its properties, more specifically at the Output List and Outer References properties:

Let’s start at the bottom, with the Outer References property. We have already seen this property in previous plansplaining posts, but as a quick reminder: Outer References means that a value from the outer input is pushed into the inner input; the inner input then ensures that only matching rows are returned which is why there is no Predicate property on this join operator. In this case Expr1005, the column that is set 1, 2, 3, and 4 for each of the rows from the Constant Scan, is pushed into the inner input.

The Output List does not include Expr1005. Apparently, the Nested Loops operator in this case doesn’t actually join its two input sources; it merely uses the outer input to drive four executions of the inner input, and then returns data from the inner input only. The values returned from the Constant Scan are not returned by the Nested Loops operator, which also means that we now know that the Stream Aggregate doesn’t do anything different than in the first plan – it doesn’t receive Expr1005, so it cannot act upon it.

Clustered Index Scan

The next part of the execution plan to look at is the section with a Hash Match and two Clustered Index Scan operators. I am not discussing the Hash Match here, because there is nothing of interest on this operator. However, both of the Clustered Index Scan operators have a property that most people would not expect to see, ever, on a Clustered Index Scan (or, in fact, any Index Scan) operator: a Seek Predicates property! (I show a screenshot of the second Clustered Index Scan here, but they both have a very similar Seek Predicates property).

Normally, a Seek Predicates property is only found on (Clustered) Index Seek operators, and not on a (Clustered) Index Scan. An Index Scan is not designed to seek; it is designed to read a whole table. So how can this scan suddenly apply a Seek Predicates? And also, what exactly is the predicate here? A column called PtnId1002, that we never created in any of our tables, is compared to the Expr1005 column, the data coming from the Constant Scan and pushed into the Nested Loop’s inner input.

Luckily the optimizer loves to use mnemonic codes when introducing its own columns. The name PtnId1002 is a so-called “partition id”. Remember, all the tables used in this example are partitioned tables. And partitioned tables happen to be the only (as far as I know) context where you can actually see a Seek Predicates property on a scan operator. It is used to limit the scan to one or more selected partitions only. In this case a single partition. Which partition? Well, that is determined by Expr1005.

Remember, Expr1005 is set to 1 for the first row (and hence the first execution of this section of the execution plan), then to 2 for the second, and so on. So for the first execution, the Clustered Index Scan operators will read data from partition 1 only. This applies to both Clustered Index Scan operators. The Hash Match then combines this data, returning the joined results for partition 1, which Nested Loops then passes to Stream Aggregate. After that, the second row from Constant Scan is read, and the inner loop restarts, this time reading and joining data from the second partition of both tables. Once all four partitions are processed in this way, Stream Aggregate returns the maximum value it found and execution stops.

Helicopter view

Moving back to the helicopter view, we now know that the section consisting of one Hash Match and two Clustered Index Scans only appears to be the same in the two plans. In reality, the first plan actually processed the entire tables in the Clustered Index Scan operators, causing a massive join between 20,000 rows from each input. The second plan used a Constant Scan to enumerate the partitions, a Nested Loops to repeat the section for each partition, and a surprising Seek Predicates property on each of the Clustered Index Scan operators to process only a single partition in each execution. So while the logic is indeed executed four times, each execution now only had to join 5,000 rows from each input. And though the scans execute four times, they do not scan the full table four times. Each execution scans only the requested partition. In the end, the amount of IO for the Clustered Index Scan operators in this plan is exactly the same as in the first plan.

This is an optimization pattern known as “join collocation”. The most obvious place where you can see the effect of this optimization is in the memory grant (visible as a property of the SELECT operator in each execution plan). The first query requests 6112 KB, needed for the Hash Match operator to store 20,000 rows in the in-memory hash table. Because the second query processes the same 20,000 rows in four independent chunks of 5,000 rows each, its memory grant is a mere 1536 KB. This memory will be reused for each of the four executions.

On my laptop, this is not very relevant. But imagine a busy system, with lots of users, plus a buffer pool, a plan cache, and other resources, all competing over the same precious memory. Now a 75% reduction in the memory footprint of a query suddenly becomes very valuable!

Why not both?

At this point you may be wondering why SQL Server doesn’t always apply this pattern? Why did we not get a similar “join collocation” plan for the first query? Or for every join ever done between two large tables?

The answer is simple. Join collocation needs a few conditions to be met before it can be used. The conditions are:

  1. Both tables in the join need to be partitioned.
  2. The join condition has to include equality of the partition columns in the two tables.
  3. Both tables in the join need to use the same partitioning function. Or rather, the partitions need to align exactly (if you use two different partition functions with the exact same definition, you will still get join collocation).

The combination of these three requirements enables the join collocation pattern. Because the tables are partitioned, the (Clustered) Index Scan operators can use a Seek Predicates to return only a subset of the data without overhead. Because the partitions align and the join is on equality of the partition columns, we know that data that needs to be joined from both sources is processed in the same execution of the inner input.

In the case of the first query, the partitions did not align completely: they use the same boundary values, but one uses RANGE LEFT and the other RANGE RIGHT. Now it is possible that a row that is in partition 1 of the first table may need to be joined to a row in partition 2 of the second table, but these rows would not be processed during the same execution so the results can be incorrect.

Conclusion

Details are always important. Just because two execution plan fragments look the same in the graphical execution plan, does not mean they are the same. You always need to look at all the details that are available in the property sheet.

In this post, it appeared at first sight as if the same logic was being executed multiple times. But by looking at the properties we were able to ascertain that these four executions each worked on a quarter of the data. We were also able, by looking at the properties of a Constant Scan and following the generated column through the execution plan, to make sense of an operator that at first appeared to make no sense at all.

In plansplaining part four, planned for April, we will look at an execution plan that uses Bitmap operators to optimize a star join. However, I do not want this series to be about only my interesting plans. Reader input is greatly encouraged! If you have ever run into a pattern in an execution plan that appeared to make no sense until you dug into the gory details, or even into a pattern that still makes no sense to you, let me know and I will consider it for inclusion in a future post. (Note that I WILL ask you for a copy of the actual execution plan, saved as a .sqlplan file, or for a repro script).

Tags:

Plansplaining, part 2. Why scan and spool instead of seek?

This is the second post in the plansplaining series. Each of these blog posts focuses on a sample execution plan and details exactly how that plan works.

In the first post, I covered each individual step of each operator in great detail, to make sure that everyone understands exactly how operators work in the pull-based execution plans. In this post (and all future installments), I will leave out the details that I now assume to be known to my readers. If you did not read part 1 already, I suggest you start there.

Sample query

The query in this post is again based on the standard AdventureWorks sample database. But it needs an additional index, so let’s create that first:

With the index in place, let’s now look at the sample query:

The index appears to be custom tailored for this query. Nobody would ever blame you if you expect an execution plan that reads the SalesPerson table, then uses a Nested Loops join into an Index Seek on the new index, plus of course some aggregation somewhere. However, you would be wrong. In reality, the execution plan is as shown below:

The plan does indeed read all rows from SalesPerson, using a Clustered Index Scan. It also does feed those rows into a Nested Loops join. But the inner (lower) input of the Nested Loops uses an Index Scan instead of an Index Seek, and adds a Table Spool. Why did the optimizer choose this plan? And perhaps even more important, how does it even give us the data we need?

Let’s dive in

Let’s dive in, and let’s focus directly on the part that matters: the inner input of the Nested Loops join. After Nested Loops receives the first row from Clustered Index Scan, it initializes the lower branch and requests the first row.

The first iteration

The Index Scan on the right-hand side of that input has its Ordered property set to True, so in this case we know for sure that data will be returned ordered by SalesTerritoryID: first all rows with SalesTerritoryID 1, then those with 2, etc. But of course, as all operators it only returns data when requested to give data. In this case it is a Stream Aggregate that calls the Index Scan. The Stream Aggregate will continue to call Index Scan until the value in its Group By column (TerritoryID) changes. So in this case, it will read all rows with SalesTerritoryID 1, then receive the first row with SalesTerritoryID 2, and then stop calling Index Scan and return a single row with SalesTerritoryID 1 and a second column (Expr1002) that holds the sum of all TaxAmt values. (This can be seen from the Defined Values property).

Table Spool is working as a Lazy Spool. Upon receiving this first row, it stores it in a worktable, and then immediately returns it to Nested Loops. This operator will verify its predicate to test if this row matches the row from the SalesPerson table. If it doesn’t, then it calls the Table Spool again, requesting the next row. If it does, it returns the combined results and waits until called again, and then requests the next row from Table Spool. So either way, Table Spool is requested for a second row.

When Table Spool regains control, it will once more pass on the request for a row to Stream Aggregate. This operator already received a single row with SalesTerritoryID 2. It now repeatedly invokes Index Seek to get the other rows for this territory, until it sees the next new value (3). Stream Aggregate then returns SalesTerritoryID 2 and its SUM(TaxAmt) to Table Spool, which adds this row to the worktable and then returns it. The worktable now has two rows stored, for the first two sales territories.

The above process repeats until all rows from SalesOrderHeader are read. There are only ten distinct SalesTerritoryID values in this table, so Table Spool returns ten rows (after storing them in the worktable). Upon the eleventh call, Table Spool returns “end of data”. Nested Loops has Left Outer Join as its logical operation, so it checks if at least one match for the row from SalesPerson was found. If not it now returns that row with a NULL value for Expr1002 and waits to regain control; otherwise it will immediately move on.

The second iteration

When the first iteration of the inner (lower) input is done, Nested Loops returns to the outer (upper) input and reads the second sales person. It then reinitializes the inner input and again calls the Table Spool. However, this second initialization is a rewind. When a Nested Loops operator does not have an Outer References property (as in this case), then by definition the first execution of the inner input is a rebind and all other executions are rewinds.

Table Spool is one of a select group of operators that actually care about the difference between rebinds and rewinds. The description above, for the first execution, is for a rebind. For a rewind, the Table Spool does not call its child operators. Instead, it uses the worktable created previously to return the same set of rows as the previous time.

So for the second salesperson, Table Spool is called, looks at the worktable, and returns the first row stored (the row with SalesPersonID 1 and its total TaxAmt value). Then when Nested Loops calls it again it reads and returns the second row from the worktable, and so on until all rows are returned and it once more responds with “end of data”.

Bringing it together

There are a total of seventeen rows in the SalesPerson table. For each of those rows the inner (lower) input is executed. The first execution is a rebind, so the Index Scan and Stream Aggregate are returned. All of the other executions are rewinds; for these the child operators do not run because Table Spool returns the data from the worktable.

Do not be misled by first appearances. Normally the lower input of a Nested Loops executes multiple times. Normally you do not want to see a scan there, especially not on a table as large as SalesOrderHeader. But in this case, first looks deceive. The only operator that actually executes multiple times is the Table Spool, which uses a worktable to store the output for later reuse, shielding its child operators from having to execute multiple times.

Much of this is directly visible from the execution plan. On the Table Spool operator, you can check the Actual Rebinds and Actual Rewinds to get confirmation that there was 1 rebind and 16 rewinds. The Number of Executions on the Table Spool is 17 (as expected, to match the 17 rows read into the Nested Loops operator), but Stream Aggregate and Index Scan both have their Number of Executions set to 1: they executed for the single rebind, but were not used for the rewinds.

But what if I want a seek?

The mystery is solved. The execution plan is dissected, and I now know exactly how the operators in this plan cooperate to produce the intended results. But I still have a feeling that my original idea, with an index seek in the lower input to a Nested Loops join, would also have been a valid idea. To help me understand why the optimizer disagrees, I can do something that I would not do in production: add a hint to force my ideas upon the optimizer.

I added a FORCESEEK hint to ensure that the optimizer has no other option but to give me an execution plan that uses an Index Seek on the SalesOrderHeader table. And yet, the execution plan is not what I was expecting to see. For sure, the Index Seek is there. But there still is also a spool operator. This time it’s not a Table Spool but an Index Spool. But it’s still there!

So let’s once more dive into the execution plan and see what happens during execution!

The first iteration

One of the differences between the first plan and the current version is that the Nested Loops operator has replaced its Predicate property with an Outer References property. Just as in the execution plan I looked at in part 1, the Outer References column from the top input (in this case SalesTerritoryID) is used in (“pushed into”) the bottom input so that it only returns matching rows.

We can see this in the Index Seek. The new index is used to retrieve only the rows we need for the current sales territory. Because of that, Stream Aggregate has lost its Group By property; it now gets only rows for a single sales territory on each run so it can do global aggregation and just return a single row with the sum of TaxAmt in all rows it received.

The final part of the bottom input is the spool, which has changed from a Table Spool to an Index Spool. An Index Spool is very similar to a Table Spool, with one additional extra. A Table Spool just stores data in a worktable, which has a clustered index on a hidden column that represents the insertion order (used to ensure that rows are returned in the same order when they are fetched from the worktable on a rewind). An Index Seek adds an additional (nonclustered) index on top of the worktable.

Finding the key of that extra index is not trivial. The property that holds this information is not shown in the graphical execution plan. The only options are to either infer this information from the Seek Predicate property of the Index Spool, or to open the XML representation of the execution plan and look for the RangeColumns node within he SeekPredicateNew element. In most cases, the first method works just fine. In our current example, the worktable will be indexed on TerritoryID.

During the first iteration, the Index Spool receives, stores (and indexes), and returns just a single row, with the total TaxAmt for the sales territory of the first sales person.

The second iteration

Since this plan uses Outer References on the Nested Loops operators, subsequent executions of the lower input can be rebinds or rewinds. They are a rewind when the next salesperson happen to have the same sales territory as the previous one; otherwise they are a rebind.

At this point we need to look at a few more differences between Table Spool and Index Spool. For starters, a Table Spool clears out its worktable on starts building from scratch on a rebind; an Index Spool doesn’t clear the worktable but simply adds the new rows to the data already there. Second, a rebind causes a Table Spool to return all rows stored in the worktable; an Index Spool will only return rows that match its Seek Predicate property (for which it uses the additional nonclustered index). These are the rows that were added during the last rebind. So in effect, both operators return the same rows as the previous time on a rewind; Table Spool does this by throwing out everything it had before, and Index Spool does this by using an index to read only the most recently added data.

So long story short, every execution of the lower branch will either use the Index Seek to find only matching rows, total up the tax in Stream Aggregate, and then store and return it; or use the index on the worktable to find the previously computed total tax for a sales territory.

The missing piece

One thing you should never do when working with execution plans is jumping to conclusions. The conclusion above is correct, but incomplete, and that can be seen by looking at all the available information in the execution plan.

If you look at the properties of Index Spool, you see that Actual Rebinds is 16, and Actual Rewinds is 1. Looking at the returned query results, that makes sense. The data is not ordered by sales territory; by pure coincidence there happens to be one set of two consecutive rows with the same sales territory, which causes the single rewind.

Based on 1 rewind and 16 rebinds, the expected number of executions for the Stream Aggregate and Index Seek would be 16, right? And yet, the execution plan shows the Number of Executions for these two operators to be 11. Where are the five missing executions?

The answer is in one last difference between Index Spool and Table Spool (or rather, between Index Spool and all other operators that observe rewinds and rebinds). You see, when Index Spool is started as a rebind, it does faithfully increment the rebind counter, but then it checks the data available in its indexed worktable. If the data is needs is already there, it will ignore the instruction to rebind and instead use the nonclustered index to return the correct subset of rows.

In other words, Index Spool counts a rewind when the immediately preceding execution used the same values in the Outer References columns, but acts as a rewind when any of the preceding executions already used the same values. There are eleven different sales values for SalesTerritoryID in the SalesPerson table (1 to 10, plus NULL). For each of these, the Stream Aggregate and Index Seek only execute the first time the value is seen; when the same value is found again, no matter how many rows were in between, the worktable is used to pull the total tax amount.

Do we really need that spool?

It is now clear what the function of the Index Spool is. Without it, the Index Seek would execute 17 times, and 6 of those executions would be used to retrieve data that was already retrieved before. By storing only the single aggregated row in a worktable, a lot of I/O on the SalesOrderHeader can be avoided.

If you don’t believe me, you can easily check this. In SQL Server 2016, a new query hint was added: NO_PERFORMANCE_SPOOL. This prevents the optimizer from adding a spool operator in queries to help performance. (It can still use spools for other reasons).

With this hint added, we finally get the execution plan that I described at the start of this post: a simple Nested Loops into an Index Seek with aggregation:

Because the spool is removed, Stream Aggregate and Index Seek now execute once for each of the 17 rows coming from SalesPerson. Some of the data from SalesOrderHeader is read more than once. If you execute SET STATISTICS IO ON; before running this query, you will see 179 logical reads on this table, as opposed to 115 for the previous query. On the other hand, the 107 logical reads on the worktable for the previous query are now gone.

Faster or slower?

The total number of logical reads for this third query is lower than for the second. But not all logical IOs are created equal. The optimizer appears to consider I/O on a worktable as less expensive than on a normal table. Probably because it expects I/O in tempdb to be very likely to be in the buffer pool only, whereas I/O on a permanent table has a higher chance to be physical I/O. Also, I have a strong suspicion that the optimizer focuses on the IO cost of reading data but under-estimates the cost of writing to a worktable. (This suspicion is based on other plans I looked at, beyond the scope of this post).

The actual performance difference does not depend on assumptions in the optimizer’s costing model, but on differences in your setup. Is the data currently in the buffer pool or not? How much memory pressure on your server? How many disk files are used for tempdb and for the user database, and on what type of storage? Etc.

One way I like to use to measure performance of short-running queries like this one is to enable the “Discard results after execution” option in the SSMS Query Options, make sure that the statistics io and statistics time settings as well as the Include Actual Execution Plan option are disabled, enable “Include Client Statistics”, and type GO 1000 (or another suitably high number) after the query to make it run a thousand times. I can then look at the Total execution time in the Client Statistics pane to see how much time it took. (Obviously, this method measures only elapsed time. If your server has severe bottlenecks for e.g. IO or CPU usage, then elapsed time is not the metric you should tune for).

On my laptop, the query with Index Seek and Index Spool takes 3.2 milliseconds on average; the version without Index Spool takes 5.0 milliseconds.

Why scan instead of seek?

We now know why the optimizer adds an Index Spool when we force it to seek the index. But one question remains: why do we even have to use this hint? Why does the optimizer prefer to use an Index Scan?

A part of the answer is in the numbers. In the first plan, the Index Scan executes once, reading the entire table. In the second plan, the Index Seek reads only a subset of the table in each execution, but it executes 11 times and all executions combined still read all the rows. This is in fact a bit more expensive, as each seek starts by traversing the root and intermediate pages. You can check this if you run both queries with SET STATISTICS IO ON enabled. The first query needs 88 logical reads on SalesOrderHeader; the second query requires 115 logical reads.

The Table Spool in the first query builds a single table with 10 rows. The Index Spool also builds a single table (in this case with 11 rows), but with an additional index; that should add some cost. This, too, is visible in the statistics io: 53 logical reads on the worktable for the first query versus 107 on the second.

Finally, the Index Spool returns just a single row to Nested Loops for each iteration, whereas Table Spool stubbornly returns the full set of 10 rows every time. With this size of table, that should not affect the logical IOs (the rows are all on a single page). There will be a bit of extra CPU usage but that will be minimal in comparison to the IO cost. The same goes for the different CPU requirement for Stream Aggregate of 11 global aggregations versus a single aggregation with Group By.

So here we see that the second query uses more IOs on the SalesOrderHeader table, and has a more expensive spool type. There are definitely situations where Index Spool is great. Its ability Spool to retrieve “older” sets can save tons of processing, and its ability to use a nonclustered index to return only matching rows can also be more efficient than a simple Table Spool that returns all rows. But not in this case.

However, when I use the Client Statistics to verify this, I find that the optimizer is using incorrect assumptions. On my laptop, the original, unhinted query takes 3.8 milliseconds on average, slightly slower than the version with the FORCESEEK hint. I do not have a full explanation for this, but I assume that, because there is no contention and no memory pressure at all on my laptop, the additional processing caused by Nested Loops receiving 10 rows for every salesperson and having to test the predicate every time actually outweighs the lower logical IOs.

On a busy server, I would not expect to see the same results. And even if I did, I would probably now trade the extra IO pressure for the small performance gain.

Cleanup

I like to keep my AdventureWorks unchanged. This ensures that demo code posted by others works the same for me, and that my future posts will work as expected for all readers.

I added an index at the start of this post. I can remove that by restoring AdventureWorks from the originally downloaded backup, but in this case it is easier to simply drop the index:

Conclusion

We looked at a query where we expected a simple Index Seek, but instead had an execution plan that used an Index Scan, and a Table Spool to save and reuse the results. Because the results were aggregated, repeatedly reading them from the worktable causes far less IO then having an Index Seek that executes multiple times.

Using hints, we were able to force the optimizer to use a seek. Another hint allowed us to get rid of the spool operator. However, performance measurements prove that the spool is actually needed to get better performance.

The original choice for the index scan is not so clear cut. For the number of logical IOs, the original plan is clearly a winner. However, its actual runtime turns out to be slightly longer than the plan that uses an Index Seek. In my opinion, the performance gain is not enough to justify the additional logical IOs, so I think the optimizer actually made the right choice.

In the next installment in the Plansplaining series, I will look at an execution plan that uses Constant Scan and Nested Loops operators to, apparently, make the same logic run multiple times instead of just once. Want to know why? Just watch this space!

Tags:

Plansplaining, part 1. The unexpected aggregation and assert

When I look at an execution plan I sometimes right away see how the operators work together. Other times I need to dive into the details and put in effort before I really understand it. And occasionally, I think I understand but then am proven wrong after I start looking at the details. However, understanding all the details of an execution plan is really important when you want to optimize performance. If I see an execution plan where I do not understand the role of each and every operator, I know I do not truly understand how the query is executed. Yet.

This post is the first in a series. In every part, I will look at a query and its execution plan, highlight possibly confusing, misleading or simply non-obvious elements in the plan, and explain why those operators are there and how they interact to return the requested results. It may be a bit long sometimes, it may be a bit dry. But if you put in the effort to read this and try to understand, you will become a better performance tuner.

A simple query

In this post I will be looking at the execution plan for a very simple query (using the standard AdventureWorks sample database):

The execution plan for this query is shown below. It is from SQL Server 2017, but it is probably the same on all versions of SQL Server. At first sight it doesn’t look bad. I see an index scan for the Person table. Since there is no WHERE clause, that is to be expected. The subquery that grabs the phone number is implemented via a join operator. The logical join type is Left Outer Join, so that a row from the Person table that does not exist in the PersonPhone table is not dropped from the results.

But if I look a bit further, I also see things that are more surprising. An Assert operator is very normal in a plan for modification (insert, update, delete), for constraint checking. But this is a normal select, so why is that operator here? And why has SQL Server added an aggregation operator?

And there is one more thing. The Person table has almost 19,972 rows. Why did the optimizer choose to use a Nested Loops join? Would a Merge Join or Hash Match join not be better in this case? When I execute the query with SET STATISTICS IO ON added to the batch, I get a confirmation that the Nested Loops join operator is in fact causing overhead. The 211 logical reads on the Person table are expected for this type of query; the 40,242 logical reads on PersonPhone appear to be excessive.

Step by step

It is tempting to immediately jump into tuning mode and try to find ways to change the join type, and perhaps get rid of the Assert and Stream Aggregate operator that I never asked for. But the optimizer usually has good reasons for what it does, so I exercise restraint. I first need to know why those operators are here. And that means that I need to make sure I understand their role within the execution plan. The best way to do that is to work out exactly what happens when the execution plan runs. That can take a lot of time, but you can consider that an investment. Once you know why this plan looks the way it looks, you can apply that knowledge every time you see a similar plan.

The start

Execution plans always start with the top-left operator (SELECT) calling its child operator (Compute Scalar). This is the GetNext() call, requesting a row. Most operators cannot return a row before they receive one, so most operators respond to the first GetNext() call by immediately calling their child operator. The operators in this plan are no exception. When execution starts, SELECT calls Compute Scalar, which call Nested Loops, which in turn invokes Index Scan (the child operator on its top, or “outer” input).

The Index Scan operators is the first one that actually does something interesting. The Output List property shows which columns it returns: BusinessEntityID, FirstName, MiddleName, and LastName. There is Predicate property set, so it returns all rows. Looking at the original query, this is not surprising. But of course GetNext() requests just a single row, so the Index Scan will for now only navigate to the first page, grab the first row from that page, and return the values from that row to its parent (Nested Loops). It passes control back to Nested Loops, while maintaining state so that it can continue where it left of when called again.

Nested Loops though does not call Index Scan again. After it receives a row from the top input, it moves to the bottom input by calling the GetNext() entry point of the Assert operator.

The bottom input

The Assert operator responds to this GetNext() request by immediately calling the Stream Aggregate operator, which in turn requests a row from the Clustered Index Seek.

The properties of the Clustered Index Seek reveal that this operator will use the structure of the clustered index to navigate directly to the page that stores the phone numbers of the person just read from the Person table. At this point, there are two possibilities: a matching row is found, or not. Let’s for now assume that the operator does find a row. The Output List property shows only the PhoneNumber column, so this value is now returned to Stream Aggregate.

The Stream Aggregate operator in this case does not have the Group By property. This means that it produces a scalar aggregate: a single row with the aggregation results of all input rows. To do this it must first read all rows from the input. So after Clustered Index Seek returns its first row, Stream Aggregate updates some internal data structures and them immediately calls Clustered Index Seek again. However, I happen to know that there are no persons with more than one phone number in the PersonPhone table. So when Clustered Index Seek is called again, it will not find a second matching row and instead return an “end of data” indicator.

Looking again at the properties of Stream Aggregate, I see that it returns a row with two columns, called Expr1004 and Expr1005. Expr1004 is defined as Count(*), the number of rows, so this will be 1. The Expr1005 column is defined as ANY(PhoneNumber). ANY is an aggregate function that we cannot use in our T-SQL, but the optimizer can use it in execution plans; it means “just give me one of the values from the input, I don’t care which one”. In this case, with just one row in the input, there is actually little choice.

This row, with Expr1004 set to 1 and Expr1005 set to the phone number for the person from the current row in the Person table, is then passed to the Assert operator. Assert evaluates an expression (shown in its Predicate property) and then passes the row unchanged when that expression evaluates to NULL, or stops execution with an error when it evaluates to another value. In this plan, the Predicate property of the Assert is “CASE WHEN Expr1004 > 1 THEN 0 ELSE NULL END” (edited slightly for readability). Expr1004 is equal to 1, so this expression evaluates to NULL. Assert passes the row. However, its Output List property shows that it passes only Expr1005 (the phone number). Expr1004 is not needed anymore.

Returning the first row

After Assert returns Expr1005 to the Nested Loops operator, it can combine the data from both inputs. There is no Predicate property on this Nested Loops, so whatever is returned from the bottom input is assumed to be a match. (A safe assumption because the bottom input used a value from the top input to return only matching data; this can be seen in the Outer References property of the operator). The Nested Loops operator now returns a row with the four columns from the Person table, plus Expr1005. Which, as we remember, is “any” of the phone numbers found for the person, which was just a single one in this case.

The Compute Scalar operator then receives this row, and computes a new column, Expr1003, which is set to be equal to Expr1005. This sounds like an utter performance waste. And it would be if SQL Server would actually do this. However, the properties in an actual execution show no actual row count and no actual execution count for this operator. What we see here is an artefact of how plans are generated. Redundant Compute Scalar operators are sometimes left in the plan, but do not actually execute. They are removed from the plan in a final cleanup stage, after the plan was already selected, but this is not reflected in the plan we see. Seeing an operator with no values for the “Actual Number of Rows” and “Number of Executions” properties is a dead giveaway that the operator was removed because it was either redundant or its functionality was collapsed into another operator. In this case, it was redundant. So in reality, the row formed in the Nested Loops operator is returned to SELECT, which represents the client application in the execution plan.

An unmatched row

The SELECT operator will immediately request the next row, and as we now know it doesn’t request this of Compute Scalar but directly of Nested Loops. Nested Loops then needs to check to see if the bottom input has more rows, so it requests the next row from Assert, which relays the request to Stream Aggregate. But Stream Aggregate in this case was set up as a scalar aggregate that returns a single row from the entire input, so it can never return a second row. It returns the “end of data” indicator, that Assert faithfully passes on the Nested Loops.

Since the bottom input is now exhausted, Nested Loops moves back to its top input and requests the next row from the Index Scan. This operator will continue where it left off last time, so now the second row from the indicated index is returned. After receiving this second row, Nested Loops reinitializes its bottom input and the entire process starts over.

We already know what happens when the Clustered Index Seek does find a row. In the sake of keeping this interesting, let’s assume the second row does not have a match in PersonPhone. After Nested Loops kicks off the process, the operators call each other until Clustered Index Seek searches for a matching row and doesn’t find anything. So it can’t return a row to Stream Aggregate. It immediately returns the “end of data” indicator.

The Stream Aggregate operator usually does not return anything if it doesn’t receive any rows itself. But in this plan it is doing scalar aggregation, and that is the exception. Scalar aggregation always returns one row. In this case, Expr1004, the Count(*), will be set to 0; Expr1005 will be NULL. Assert receives this row, checks its Predicate, and since Expr1004 is not more than one, it will pass the row to the Nested Loops operator. This operator adds the NULL value in Expr1005 as the phone number, and returns this to its caller, to be sent to the client. And that’s how rows without phone numbers are handled.

Obviously, this process repeats over and over until all rows from the Person table have been handled and the query finishes. We now know exactly how the query handles all cases in our test data: both persons with and persons without a phone number. We know exactly what the Assert and Stream Aggregate operator do. But we do not yet understand WHY. So far, they have not affected the results. Why are they here?

What about multiple matches?

The Predicate expression of the Assert operator gives a clue as to why the operator is in the plan. Obviously, the optimizer wants to do something different when a person has multiple phone numbers. I know that this is not the case in the data in AdventureWorks, but SQL Server doesn’t know that. So it needs to handle this situation, and apparently it needs handling in this specific case. Let’s look. First in theory, and then test to verify.

Nobody needs more than one phone!

So now we assume that we added a row to PersonPhone, for a person that already had a row there. At one point the Index Scan will read this person. It restarts its bottom input and requests a row; the operators call each other and Clustered Index Seek finds the first matching row and returns it to Stream Aggregate. Nothing new so far, this is the same as we saw for the first row. However, now Stream Aggregate requests another row and actually gets a row returned. It updates its internal data structures and, again, requests the next row. Clustered Index Seek does not find a third phone number for this person, so “end of data” is returned.

After it has read all the input rows Stream Aggregate now uses its internal data structures to return the correct values. Expr1004, the Count(*), is equal to 2; Expr1005, the ANY expression, returns one of the two phone numbers that were input. Which one is impossible to tell; the way ANY chooses its returned value is undocumented. But we do know that it will be one of the phone numbers for this person.

However, that phone number will never make it to the client. Since Expr1004 is now 2, the Predicate of the Assert operator evaluates to 0, not NULL. Any non-NULL result causes the Assert operator to immediately stop processing of the execution plan and throw an error. Which error that is can’t be seen from the execution plan. So perhaps this is a good time to run a test and verify what happens.

Let’s test

In order to test this, I will add a row to the PersonPhone table. I can insert it and then delete it later, or use a transaction that I roll back when done. I chose the latter:

When I run this query, I get partial results. The query returns it first 12,415 rows, but then an error occurs and processing stops. The messages pane shows an error message:

And there we have our explanation. The SQL language does not like “random” results. When we use a subquery in a place where just a single value is expected, it gives us the benefit of the doubt. “My developers know what they are doing, they will know that this will never return multiple rows”. But true to the paradigm “trust, but verify”, it does add two operators, Stream Aggregate and Assert. These two operators ensure that an error message is thrown if it turns out the developer did not actually know what they were doing.

Can we tune this?

We now know why the Assert and Stream Aggregate operators are there. And this also “sort of” explains why a Nested Loops join algorithm was used. Merge Join and Hash Match both do a single pass over both inputs. The trick of aggregating all rows and seeing if there is more than one would not work if we read all rows instead of only rows matching the current person.

That doesn’t mean there are no other ways to achieve this result. The optimizer could have opted for a plan with a Merge Join or Hash Match by using a variation on this trick. It could do a Clustered Index Scan of all rows from PersonPhone, then feed those rows into a Stream Aggregate with a Group By property. This would return one row for each BusinessEntityID, with one of the phone numbers and a count. The same Assert can then stop execution when the count is too high. That would implement the same test for persons with multiple phones with just a single pass over the input, so now the two streams can be combined in a Merge Join or Hash Match join. Why didn’t the optimizer make that plan?

There can be multiple answers. Perhaps that plan introduces other overhead, making it more expensive. Perhaps the assumptions built into the costing rules make the optimizer think it is more expensive. Perhaps the set of rules used by the optimizer does not include a path from the actual plan to the theoretic one above. Or perhaps the optimizer stopped searching for better plans before it found this option.

Rewrite attempt

Sometimes rewriting a query can work to nudge the optimizer into a different plan. That is not guaranteed to work; I have been outstubborned by the optimizer more often than I care to admit. However, in this case it turns out that it’s fairly easy to convince the optimizer. Start with a CTE that reads the PersonPhone table and aggregates by Person; join that to the Person table. Actually just a query that matched the plan I am looking for.

Now I can’t directly use an Assert in my query, but I can use a CASE expression that will force a run-time error when a person has more than one phone. I also do not have an ANY aggregation function. But since I will only actually return a row when the number of rows is 1, I can pick almost every aggregation function. In this case I decided to use MIN.

The query above returns the same data as the original query. It also generates a runtime error, just as the original query, although the error message is different. And it results in the type of plan I was hoping for:

This plan uses Hash Match to join the two data streams. The top input is PersonPhone, aggregated by PersonID. The bottom input is Person. Both inputs are scanned only once, and a lot of I/O is saved this way as demonstrated by the output of SET STATISTICS IO:

The estimated cost of the execution plan for this query is 0.959. For the original plan, that was 3.921. This plan actually has a much lower estimated cost, and saves a huge amount of I/O. It does introduce a Hash Match operator, making it blocking and more memory hungry. I am going to guess that the optimizer would have picked this plan if it had found it while exploring alternatives. But that is just a guess.

Which one to use?

We now have two versions of the query. It is tempting to say that we should always use one with the lowest estimated cost, and the least I/O. But that is not always the case. The cheaper query does need more memory (in my tests it requests a memory grant of 8416KB), which can be problematic on servers that are already starved for memory. The Hash Match operator causes the plan to be partially blocking, as opposed to a fully streaming plan for the original query. And when there are persons with more than one phone, the original query stops executing as soon as it encounters one such person; the new version first finishes the entire build phase of the Hash Match and will only stop when it encounters the first problem row in the probe phase. Three reasons that might convince you, based on the requirements and limitations of your data and your server, to choose the original query.

Plus, obviously, the original query is a lot easier to understand and maintain; I would never allow the result of my rewrite attempt to go in production code without a long comment to explain what’s going on.

Conclusion

In this post I looked at a sample execution plan and dove into the details of each of the operators to understand their function relative to each other and to the plan as a whole. This is what every SQL Server professional should do when looking at an execution plan.

In this case, looking at all the details of the execution plan helped us find a way to speed up the query (with some downsides). That will not always be the case. Still, when you look at an element in an execution plan you do not understand, you always have to do the work to build that understanding. The result may not be a way to speed up the query, but you only know that for sure after you put in the work.

This post was very verbose, and very detailed. That’s because I also used this post to illustrate some basics of reading execution plans. That’s not how my mental process normally works when analyzing execution plans. I would look at the plan and blend out the parts I already understand. (I would still return to them if needed if needed to make sense of the rest). For this plan, that would leave the Assert, the Stream Aggregate, and the choice of physical join type. I would start with those operators. Upon seeing the Predicate of Assert, I would try to find where Expr1004 is computed, which would bring me to the Defined Values property of the Stream Aggregate operator. And at that point, I would understand the whole logic.

For future posts in the Plansplaining series, I plan to assume more basic understanding of execution plans than what I did for this post. They will be somewhere between the very long description in this post, and the very short description in the first paragraph.

Tags:

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.

Close