This is the second post in the plansplaining series. Each of these blog posts focuses on a sample execution plan and details exactly how that plan works.
In the first post, I covered each individual step of each operator in great detail, to make sure that everyone understands exactly how operators work in the pull-based execution plans. In this post (and all future installments), I will leave out the details that I now assume to be known to my readers. If you did not read part 1 already, I suggest you start there.
The query in this post is again based on the standard AdventureWorks sample database. But it needs an additional index, so let’s create that first:
CREATE INDEX IX_SalesOrderHeader_TerritoryID ON Sales.SalesOrderHeader (TerritoryID) INCLUDE (TaxAmt);
With the index in place, let’s now look at the sample query:
SELECT sp.BusinessEntityID, sp.TerritoryID, (SELECT SUM(TaxAmt) FROM Sales.SalesOrderHeader AS soh WHERE soh.TerritoryID = sp.TerritoryID) FROM Sales.SalesPerson AS sp;
The index appears to be custom tailored for this query. Nobody would ever blame you if you expect an execution plan that reads the SalesPerson table, then uses a Nested Loops join into an Index Seek on the new index, plus of course some aggregation somewhere. However, you would be wrong. In reality, the execution plan is as shown below:
The plan does indeed read all rows from SalesPerson, using a Clustered Index Scan. It also does feed those rows into a Nested Loops join. But the inner (lower) input of the Nested Loops uses an Index Scan instead of an Index Seek, and adds a Table Spool. Why did the optimizer choose this plan? And perhaps even more important, how does it even give us the data we need?
Let’s dive in
Let’s dive in, and let’s focus directly on the part that matters: the inner input of the Nested Loops join. After Nested Loops receives the first row from Clustered Index Scan, it initializes the lower branch and requests the first row.
The first iteration
The Index Scan on the right-hand side of that input has its Ordered property set to True, so in this case we know for sure that data will be returned ordered by SalesTerritoryID: first all rows with SalesTerritoryID 1, then those with 2, etc. But of course, as all operators it only returns data when requested to give data. In this case it is a Stream Aggregate that calls the Index Scan. The Stream Aggregate will continue to call Index Scan until the value in its Group By column (TerritoryID) changes. So in this case, it will read all rows with SalesTerritoryID 1, then receive the first row with SalesTerritoryID 2, and then stop calling Index Scan and return a single row with SalesTerritoryID 1 and a second column (Expr1002) that holds the sum of all TaxAmt values. (This can be seen from the Defined Values property).
Table Spool is working as a Lazy Spool. Upon receiving this first row, it stores it in a worktable, and then immediately returns it to Nested Loops. This operator will verify its predicate to test if this row matches the row from the SalesPerson table. If it doesn’t, then it calls the Table Spool again, requesting the next row. If it does, it returns the combined results and waits until called again, and then requests the next row from Table Spool. So either way, Table Spool is requested for a second row.
When Table Spool regains control, it will once more pass on the request for a row to Stream Aggregate. This operator already received a single row with SalesTerritoryID 2. It now repeatedly invokes Index Seek to get the other rows for this territory, until it sees the next new value (3). Stream Aggregate then returns SalesTerritoryID 2 and its SUM(TaxAmt) to Table Spool, which adds this row to the worktable and then returns it. The worktable now has two rows stored, for the first two sales territories.
The above process repeats until all rows from SalesOrderHeader are read. There are only ten distinct SalesTerritoryID values in this table, so Table Spool returns ten rows (after storing them in the worktable). Upon the eleventh call, Table Spool returns “end of data”. Nested Loops has Left Outer Join as its logical operation, so it checks if at least one match for the row from SalesPerson was found. If not it now returns that row with a NULL value for Expr1002 and waits to regain control; otherwise it will immediately move on.
The second iteration
When the first iteration of the inner (lower) input is done, Nested Loops returns to the outer (upper) input and reads the second sales person. It then reinitializes the inner input and again calls the Table Spool. However, this second initialization is a rewind. When a Nested Loops operator does not have an Outer References property (as in this case), then by definition the first execution of the inner input is a rebind and all other executions are rewinds.
Table Spool is one of a select group of operators that actually care about the difference between rebinds and rewinds. The description above, for the first execution, is for a rebind. For a rewind, the Table Spool does not call its child operators. Instead, it uses the worktable created previously to return the same set of rows as the previous time.
So for the second salesperson, Table Spool is called, looks at the worktable, and returns the first row stored (the row with SalesPersonID 1 and its total TaxAmt value). Then when Nested Loops calls it again it reads and returns the second row from the worktable, and so on until all rows are returned and it once more responds with “end of data”.
Bringing it together
There are a total of seventeen rows in the SalesPerson table. For each of those rows the inner (lower) input is executed. The first execution is a rebind, so the Index Scan and Stream Aggregate are returned. All of the other executions are rewinds; for these the child operators do not run because Table Spool returns the data from the worktable.
Do not be misled by first appearances. Normally the lower input of a Nested Loops executes multiple times. Normally you do not want to see a scan there, especially not on a table as large as SalesOrderHeader. But in this case, first looks deceive. The only operator that actually executes multiple times is the Table Spool, which uses a worktable to store the output for later reuse, shielding its child operators from having to execute multiple times.
Much of this is directly visible from the execution plan. On the Table Spool operator, you can check the Actual Rebinds and Actual Rewinds to get confirmation that there was 1 rebind and 16 rewinds. The Number of Executions on the Table Spool is 17 (as expected, to match the 17 rows read into the Nested Loops operator), but Stream Aggregate and Index Scan both have their Number of Executions set to 1: they executed for the single rebind, but were not used for the rewinds.
But what if I want a seek?
The mystery is solved. The execution plan is dissected, and I now know exactly how the operators in this plan cooperate to produce the intended results. But I still have a feeling that my original idea, with an index seek in the lower input to a Nested Loops join, would also have been a valid idea. To help me understand why the optimizer disagrees, I can do something that I would not do in production: add a hint to force my ideas upon the optimizer.
SELECT sp.BusinessEntityID, sp.TerritoryID, (SELECT SUM(TaxAmt) FROM Sales.SalesOrderHeader AS soh WITH (FORCESEEK) WHERE soh.TerritoryID = sp.TerritoryID) FROM Sales.SalesPerson AS sp;
I added a FORCESEEK hint to ensure that the optimizer has no other option but to give me an execution plan that uses an Index Seek on the SalesOrderHeader table. And yet, the execution plan is not what I was expecting to see. For sure, the Index Seek is there. But there still is also a spool operator. This time it’s not a Table Spool but an Index Spool. But it’s still there!
So let’s once more dive into the execution plan and see what happens during execution!
The first iteration
One of the differences between the first plan and the current version is that the Nested Loops operator has replaced its Predicate property with an Outer References property. Just as in the execution plan I looked at in part 1, the Outer References column from the top input (in this case SalesTerritoryID) is used in (“pushed into”) the bottom input so that it only returns matching rows.
We can see this in the Index Seek. The new index is used to retrieve only the rows we need for the current sales territory. Because of that, Stream Aggregate has lost its Group By property; it now gets only rows for a single sales territory on each run so it can do global aggregation and just return a single row with the sum of TaxAmt in all rows it received.
The final part of the bottom input is the spool, which has changed from a Table Spool to an Index Spool. An Index Spool is very similar to a Table Spool, with one additional extra. A Table Spool just stores data in a worktable, which has a clustered index on a hidden column that represents the insertion order (used to ensure that rows are returned in the same order when they are fetched from the worktable on a rewind). An Index Seek adds an additional (nonclustered) index on top of the worktable.
Finding the key of that extra index is not trivial. The property that holds this information is not shown in the graphical execution plan. The only options are to either infer this information from the Seek Predicate property of the Index Spool, or to open the XML representation of the execution plan and look for the RangeColumns node within he SeekPredicateNew element. In most cases, the first method works just fine. In our current example, the worktable will be indexed on TerritoryID.
During the first iteration, the Index Spool receives, stores (and indexes), and returns just a single row, with the total TaxAmt for the sales territory of the first sales person.
The second iteration
Since this plan uses Outer References on the Nested Loops operators, subsequent executions of the lower input can be rebinds or rewinds. They are a rewind when the next salesperson happen to have the same sales territory as the previous one; otherwise they are a rebind.
At this point we need to look at a few more differences between Table Spool and Index Spool. For starters, a Table Spool clears out its worktable on starts building from scratch on a rebind; an Index Spool doesn’t clear the worktable but simply adds the new rows to the data already there. Second, a rebind causes a Table Spool to return all rows stored in the worktable; an Index Spool will only return rows that match its Seek Predicate property (for which it uses the additional nonclustered index). These are the rows that were added during the last rebind. So in effect, both operators return the same rows as the previous time on a rewind; Table Spool does this by throwing out everything it had before, and Index Spool does this by using an index to read only the most recently added data.
So long story short, every execution of the lower branch will either use the Index Seek to find only matching rows, total up the tax in Stream Aggregate, and then store and return it; or use the index on the worktable to find the previously computed total tax for a sales territory.
The missing piece
One thing you should never do when working with execution plans is jumping to conclusions. The conclusion above is correct, but incomplete, and that can be seen by looking at all the available information in the execution plan.
If you look at the properties of Index Spool, you see that Actual Rebinds is 16, and Actual Rewinds is 1. Looking at the returned query results, that makes sense. The data is not ordered by sales territory; by pure coincidence there happens to be one set of two consecutive rows with the same sales territory, which causes the single rewind.
Based on 1 rewind and 16 rebinds, the expected number of executions for the Stream Aggregate and Index Seek would be 16, right? And yet, the execution plan shows the Number of Executions for these two operators to be 11. Where are the five missing executions?
The answer is in one last difference between Index Spool and Table Spool (or rather, between Index Spool and all other operators that observe rewinds and rebinds). You see, when Index Spool is started as a rebind, it does faithfully increment the rebind counter, but then it checks the data available in its indexed worktable. If the data is needs is already there, it will ignore the instruction to rebind and instead use the nonclustered index to return the correct subset of rows.
In other words, Index Spool counts a rewind when the immediately preceding execution used the same values in the Outer References columns, but acts as a rewind when any of the preceding executions already used the same values. There are eleven different sales values for SalesTerritoryID in the SalesPerson table (1 to 10, plus NULL). For each of these, the Stream Aggregate and Index Seek only execute the first time the value is seen; when the same value is found again, no matter how many rows were in between, the worktable is used to pull the total tax amount.
Do we really need that spool?
It is now clear what the function of the Index Spool is. Without it, the Index Seek would execute 17 times, and 6 of those executions would be used to retrieve data that was already retrieved before. By storing only the single aggregated row in a worktable, a lot of I/O on the SalesOrderHeader can be avoided.
If you don’t believe me, you can easily check this. In SQL Server 2016, a new query hint was added: NO_PERFORMANCE_SPOOL. This prevents the optimizer from adding a spool operator in queries to help performance. (It can still use spools for other reasons).
SELECT sp.BusinessEntityID, sp.TerritoryID, (SELECT SUM(TaxAmt) FROM Sales.SalesOrderHeader AS soh WITH (FORCESEEK) WHERE soh.TerritoryID = sp.TerritoryID) FROM Sales.SalesPerson AS sp OPTION (NO_PERFORMANCE_SPOOL);
With this hint added, we finally get the execution plan that I described at the start of this post: a simple Nested Loops into an Index Seek with aggregation:
Because the spool is removed, Stream Aggregate and Index Seek now execute once for each of the 17 rows coming from SalesPerson. Some of the data from SalesOrderHeader is read more than once. If you execute SET STATISTICS IO ON; before running this query, you will see 179 logical reads on this table, as opposed to 115 for the previous query. On the other hand, the 107 logical reads on the worktable for the previous query are now gone.
Faster or slower?
The total number of logical reads for this third query is lower than for the second. But not all logical IOs are created equal. The optimizer appears to consider I/O on a worktable as less expensive than on a normal table. Probably because it expects I/O in tempdb to be very likely to be in the buffer pool only, whereas I/O on a permanent table has a higher chance to be physical I/O. Also, I have a strong suspicion that the optimizer focuses on the IO cost of reading data but under-estimates the cost of writing to a worktable. (This suspicion is based on other plans I looked at, beyond the scope of this post).
The actual performance difference does not depend on assumptions in the optimizer’s costing model, but on differences in your setup. Is the data currently in the buffer pool or not? How much memory pressure on your server? How many disk files are used for tempdb and for the user database, and on what type of storage? Etc.
One way I like to use to measure performance of short-running queries like this one is to enable the “Discard results after execution” option in the SSMS Query Options, make sure that the statistics io and statistics time settings as well as the Include Actual Execution Plan option are disabled, enable “Include Client Statistics”, and type GO 1000 (or another suitably high number) after the query to make it run a thousand times. I can then look at the Total execution time in the Client Statistics pane to see how much time it took. (Obviously, this method measures only elapsed time. If your server has severe bottlenecks for e.g. IO or CPU usage, then elapsed time is not the metric you should tune for).
On my laptop, the query with Index Seek and Index Spool takes 3.2 milliseconds on average; the version without Index Spool takes 5.0 milliseconds.
Why scan instead of seek?
We now know why the optimizer adds an Index Spool when we force it to seek the index. But one question remains: why do we even have to use this hint? Why does the optimizer prefer to use an Index Scan?
A part of the answer is in the numbers. In the first plan, the Index Scan executes once, reading the entire table. In the second plan, the Index Seek reads only a subset of the table in each execution, but it executes 11 times and all executions combined still read all the rows. This is in fact a bit more expensive, as each seek starts by traversing the root and intermediate pages. You can check this if you run both queries with SET STATISTICS IO ON enabled. The first query needs 88 logical reads on SalesOrderHeader; the second query requires 115 logical reads.
The Table Spool in the first query builds a single table with 10 rows. The Index Spool also builds a single table (in this case with 11 rows), but with an additional index; that should add some cost. This, too, is visible in the statistics io: 53 logical reads on the worktable for the first query versus 107 on the second.
Finally, the Index Spool returns just a single row to Nested Loops for each iteration, whereas Table Spool stubbornly returns the full set of 10 rows every time. With this size of table, that should not affect the logical IOs (the rows are all on a single page). There will be a bit of extra CPU usage but that will be minimal in comparison to the IO cost. The same goes for the different CPU requirement for Stream Aggregate of 11 global aggregations versus a single aggregation with Group By.
So here we see that the second query uses more IOs on the SalesOrderHeader table, and has a more expensive spool type. There are definitely situations where Index Spool is great. Its ability Spool to retrieve “older” sets can save tons of processing, and its ability to use a nonclustered index to return only matching rows can also be more efficient than a simple Table Spool that returns all rows. But not in this case.
However, when I use the Client Statistics to verify this, I find that the optimizer is using incorrect assumptions. On my laptop, the original, unhinted query takes 3.8 milliseconds on average, slightly slower than the version with the FORCESEEK hint. I do not have a full explanation for this, but I assume that, because there is no contention and no memory pressure at all on my laptop, the additional processing caused by Nested Loops receiving 10 rows for every salesperson and having to test the predicate every time actually outweighs the lower logical IOs.
On a busy server, I would not expect to see the same results. And even if I did, I would probably now trade the extra IO pressure for the small performance gain.
I like to keep my AdventureWorks unchanged. This ensures that demo code posted by others works the same for me, and that my future posts will work as expected for all readers.
I added an index at the start of this post. I can remove that by restoring AdventureWorks from the originally downloaded backup, but in this case it is easier to simply drop the index:
DROP INDEX IX_SalesOrderHeader_TerritoryID ON Sales.SalesOrderHeader;
We looked at a query where we expected a simple Index Seek, but instead had an execution plan that used an Index Scan, and a Table Spool to save and reuse the results. Because the results were aggregated, repeatedly reading them from the worktable causes far less IO then having an Index Seek that executes multiple times.
Using hints, we were able to force the optimizer to use a seek. Another hint allowed us to get rid of the spool operator. However, performance measurements prove that the spool is actually needed to get better performance.
The original choice for the index scan is not so clear cut. For the number of logical IOs, the original plan is clearly a winner. However, its actual runtime turns out to be slightly longer than the plan that uses an Index Seek. In my opinion, the performance gain is not enough to justify the additional logical IOs, so I think the optimizer actually made the right choice.
In the next installment in the Plansplaining series, I will look at an execution plan that uses Constant Scan and Nested Loops operators to, apparently, make the same logic run multiple times instead of just once. Want to know why? Just watch this space!