Troubleshoorting performance issues can be hard. Having as much information available at our fingertips does not magically make this easier but it does at least save us the burden of having to collect that information from other places. That’s why I like how much extra information Microsoft has been adding to execution plans in recent releases.
During this year’s PASS Summit, I went to a session presented by Pedro Lopes and Joe Sack. Pedro explained in his half of the session that Microsoft has the goal of making sure that everything we need for troubleshooting will be available in the execution plans, and he showed a recent improvement that has me very excited.
How long did that take you?
But let’s first look at a slightly older improvement.
When I am tuning a query, I am not always looking at the same metric. Sometimes I want the query to finish as fast as possible, especially when the CEO is waiting for their favorite report to finish. Sometimes I want it to take as few CPU cycles as possible, especially on a server that has extreme CPU pressure. And sometimes I want to minimize logical IOs when the server is mostly restrained by IO speed and buffer pool usage.
For the first two metrics, SET STATISTICS TIME ON has always been my friend. For those who have never used this, here is what you will see in the messages pane after running a test query with this option enabled:
The first line shows the effort required to compile the execution plan. This can be useful if you are for instance working with extremely complex issues and you suspect that the optimizer may be spending too much time on this. This same data has also been included in the execution plan, since as long as I remember. You can see it by right-clicking on the top-left operator of the execution plan and then selecting “Properties”. (Or if you are like me and leave the properties window open at all time, just click on that operator to make it the focus of the properties window). See the green boxed areas in the screenshot below. (You may notice that the values are not exactly the same. They are measured in different ways. However, they are always close enough that I feel safe to use either option).
The second part of the SET STATISTICS TIME output shows the same details of time elapsed and CPU time used for when the execution plan was actually executing. This is often far more useful – after all, most queries run longer than they compile, and they also execute more often because the execution plan gets stored in the plan cache. So the elapsed and CPU time of the query execution is a far more useful metric.
This metric was for a long time not avaibale in the execution plan – but recently this was added (and even backported to earlier versions). The red boxed area in the screenshot shows the QueryTimeStats node, which can be expanded to show the CpuTime and ElapsedTime properties. Again, the values differ slightly from the values that SET STATISTICS TIME shows, but the differences are really minor.
Since these values are now included in every execution plan, I would argue that there is no need anymore to ever use SET STATISTICS TIME ON. The overhead this command causes may be very small but I still prefer no overhead at all!
Enter the new era!
To quote late night advertorial TV programs: “but wait … there’s more!”
If you upgrade SQL Server 2017 to the latest Cumulative Update (CU12 at the time I write this), then you may see even more data under the QueryTimeStats node, as shown in the next screenshot:
In this case the execution plan shows not only the CPU time and elapsed time for the entire execution plan, it also shows the CPU time and elapsed time that was spent while evaluating scalar user-defined functions (UDFs). This is really fantastic! I mean, I already knew that scalar UDFs can be bad for performance but in this case I can even tell exactly how bad they were for this query – almost all the CPU and elapsed time for this execution were spent on scalar UDFs! If this is the query that I need to tune to fix a performance issue in the finance department database, I know right away that I do not even need to look at the execution plan of the query itself. I can jump right into the scalar UDFs used by the query.
And conversely, if I see a very low number for the UdfCpuTime and UdfElapsedTime properties, then even though a scalar UDF is used by the query I know that it does not cause the issues this time and that I should focus on the main query.
Previously this information was not available in any way other than running a relatively expensive Extended Events session on the server all the time.
When I saw Pedro present this, I got very excited. Big thumbs up to Microsoft!
Another feature that is available in execution plans but not in SET STATISTICS TIME, since SQL Server 2014 already, is the specification of elapsed and CPU time per operator. See this screenshot for an example of how the properties of a single operator within a sample execution plan might look.
As you can see in the red marked area, I can see exactly how much CPU time (confusingly marked as “Actual Elapsed CPU Time” and elapsed time this operator used. And I can see this for each thread in a parallel plan. This makes it far easier to find out which operators in a complex plan cause it to be slow, or to identify issues cause by parallelism skew.
Unfortunately, the per-operator display of CPU usage and elapsed time does not expose the time spent in scalar UDFs. Yet (?).
But what about the I/Os?
As mentioned before, I am not always interested in the same metric when tuning. What if I am primarily interested in reducing the number of logical reads? I obviously can use the SET STATISTICS IO ON statement to get output such as shown below (click to enlarge) after running the query:
But wouldn’t it be nice if this information, too, was available within the execution plan itself? Turns out, today is our lucky day!
In the screenshot above you may already have noticed the “Actual I/O Statistics” node. This node is not included in all operators, but you will see it on each operator that does (or even might do, e.g. in the case of Sort or Hash Match operators that may spill) I/O while executing. Expanding that node gives me a very nice breakdown of the I/O done by each of the operators in the execution plan. It shows the same information SET STATISTICS IO shows, but again broken down per operator and per thread.
However, where CPU time and elapsed time are shown by operator, and shown aggregated for the entire plan, that is not the case for the I/O statistics. If I want to get the total for the query, I will have to go through the entire plan and manually tally up all the numbers – or continue to use (and accept the overhead of) SET STATISTICS IO ON.
Another shortcoming that I see in this per-operator display of the I/O statistics is that it does not show how many segments were eliminated, a very handy feature that has been added to the SET STATISTICS IO output a few years ago and would be nice to have inside the execution plan as well.
However, despite all the shortcomings it is of course incredibly useful to have all this information available in the execution plan.
With so much goodness in the execution plans, you might be excused for expecting me to be perfectly happy and satisfied right now. But you would be wrong. Those who know me better know that, regardless of how happy I am with any given feature in SQL Server, I will always continue to press Microsoft to do even better.
I have decided to list all my wishes here. But I have also added all these wishes as requests on User Voice, Microsoft’s new feedback channel for SQL Server related issues. The list below is in random order as I want all these features equally much. However, if you think that some are more important than others, let Microsoft know by voting and/or commenting on the items.
Show total I/O for the entire execution plan in the top-left node. The QueryTimeStats node on the top-left operator shows the total time for the entire execution plan which saves us the effort of having to tally up all the data from the Actual Time Statistics nodes of each individual operator. It would be great if there was a similar QueryIOStats node to immediately see the total logical and physical reads for all operators in the plan combined.
Show I/O from user-defined functions in execution plan. The addition of UdfCpuTime and UdfElapsedTime to the QueryTimeStats node in an actual execution plan is a huge step forward, but I/O done from a scalar UDF is often an even higher problem and there is currently no way, other than an Extended Events session, to get insight in that. It would be awesome if Microsoft can add this to the execution plan, either for the node where the UDF is invoked or for the plan as a whole (ideally both). It does not have to be broken down for each individual table, it can be a grand total if that is easier to implement.
Add count of logical writes to SET STATISTICS IO and to Actual I/O Statistics in execution plans. Currently neither the output of SET STATISTICS IO nor the data captured in the Actual I/O Statistics property in execution plans expose the number of logical writes. For Insert, Update, Delete, and Merge operators, this is a minor nuisance; usually the number of logical writes is well known and it is definitely seldom surprising. But for operators that use worktables and workfiles, such as Table Spool or Hash Match, there currently is no way to see how many logical writes were used. This can, for instance, cause a many-to-many Merge Join to show only a very low number of logical reads to the worktable because there were very few actual duplicates in the data, while hiding the perhaps millions of logical writes that may be an actual performance problem in the execution plan.
Add segment elimination data to Actual I/O Statistics in execution plans. Operators that have the ability to do I/O show their Actual I/O Statistics in an actual execution plan, allowing developers and DBAs to gain insight in the amount of I/O done for each individual operator. However, the information about number of segments read and number of segments eliminated that has been added to SET STATISTICS IO for queries that involve columnstore indexes is not included in the Actual I/O Statistics. It would be very useful to have this information there as well.
Add CPU and elapsed time for UDFs to ActualTimeStatistics. The addition of UdfCpuTime and UdfElapsedTime to the QueryTimeStats node in an actual execution plan is a huge step forward. But it would be very useful to also have this information on a per-operator level, especially in plans where multiple operators invoke scalar user-defined functions.
NOTE: Today (December 5), I updated this post. Eugene Karpovich pointed out that I had mistakenly typed SET STATISTICS IO instead of SET STATISTICS TIME multiple times in the first part of the post. This is now corrected. Thanks, Eugene!!
My very first SQL Server related job was quite unusual. I was added to a team that developed software using a code generator. We had to define business rules (such as constraints and derivation rules) on top of a conceptual information model for the company. The tool would then generate T-SQL code to support and implement those rules. The generated code used triggers. Lots of them. And it worked.
So even after I moved on to other, more regular SQL Server development work, I never forgot that triggers are very good at doing what they are supposed to do: respond to certain changes no matter where they originate from. But a lot of people in the SQL Server community have a thorough dislike, perhaps even hatred for trigger. It took me some time to understand where this is coming from.
This post is about the two main reasons why people dislike triggers, and how to alleviate these concerns.
It doesn’t work!
An often heard complaint about triggers is that they suddenly do not work, or do not work correctly, in production. Whenever I get a chance to investigate these situations, I always see that the error is not caused by “triggers” (the concept), but by “that trigger” (or rather, the coding errors in it).
The code below is a typical coding pattern that I have seen far too often and that I hope never to see again:
-- Find the row that was inserted
-- Do something relevant with the row
/* REST OF THE CODE GOES HERE */
A developer can write code such as the above. They will test their code using a front end tool that sends individual single-row insert, update, and delete statements. They will perhaps, if you are lucky, also run a few quick tests from Management Studio – but again using singleton statements. They will then conclude that their code works correctly.
And then the code is deployed to production. And in production, a nightly job runs an import into a staging table, does some cleanup, and then inserts all new rows in a single statement. The trigger will run, correctly. Not correctly as intended, but correctly as written.
You see, a trigger does not run once per row (it does in some other products, but not in SQL Server). When you run an insert, update, or delete statement, a trigger will run exactly once, exposing all affected rows in the inserted and/or deleted pseudo-tables. This can be a single row, or it can be multiple rows, or no rows at all. The trigger needs to cater to all those possibilities. The developer needs to test all those scenarios. When you run a merge statement, a trigger can run up to three times (for inserted, updated, and deleted rows). You also need test cases for these scenarios.
The sample code above would fail some of these tests. The instructions are clear: look at all the rows in the inserted pseudo-table, pick one (more or less at random) to store in the variable, and then ignore the rest. The user will scream that the system malfunctions. The developer will look at their unit tests and claim that it’s SQL Server’s fault for not executing the trigger correctly. In reality, it is the developers fault for not writing and not testing their code correctly.
Unexpected side effects
The other common complaint is coming mostly from DBAs. Their problem is awareness. Or rather, lack thereof.
It is not that triggers are specifically hidden. They are visible in the Object Explorer. They can also be found by querying the DMVs (sys.triggers for DML triggers). However, it is true that they are not exactly in your face.
Suppose you are the DBA on call. You are disturbed in your sleep at 3AM, and you need to do a quick repair of some dirty data to ensure that the nightly batch can continue to run and finish before the maintenance window finishes and the first branches open for business. Are you sure that in your sleepy state you will always remember to click the Triggers tab in Object Explorer? Or can you imagine quickly typing a statement to delete or update a bad row and then firing it off without realizing that you just set off a trigger? And now, perhaps, that trigger has just sent a message on a Service Broker queue that will do all kinds of stuff in the remote CRM system, and send some emails to your client list. Oops….
This is an area where Microsoft can help out. Microsoft can change the Object Explorer so that tables (and views!) with triggers stand out a bit more. Would it be a huge effort to add a red lightning bolt symbol on top of the table symbol, as a warning that one or more active triggers exist for that table? If you would like to see this implemented, then please vote for my suggestion here.
Tool vendors can help as well. I am personally a huge fan of SQL Prompt. One of the nice features this tool offers is to pop up a warning symbol if I am about to execute a statement or a batch that includes an update or delete statement without WHERE clause. Would it be very hard for Red Gate to also implement a warning if I am about to run a statement that would affect a table or view that has an active trigger? Again: if you like it, vote for my suggestion!
Triggers are a great tool. They offer a functionality that is not otherwise available, and that can sometimes work wonders – just think of logging changes for audits, or INSTEAD OF triggers on views that would otherwise not be updatable.
When coded badly, they will fail. That is not a trigger problem. It is a bad code problem, which is often caused by incorrect assumptions on how triggers fire. Just remember, a trigger typically triggers once per statement, for all affected rows. And a MERGE statement is executed as combined INSERT, UPDATE, and DELETE actions, that each have their own trigger action.
It is possible to get unexpected results if you are not aware which triggers exist. That is an issue that developers and DBAs cannot fix. Microsoft and tool vendors can, and I have used their feedback channels to request exactly that.
It all started back in 2012 – yes, six years ago! That was when I was first asked if I wanted to be the technical reviewer for Grant Frtichey’s book on execution plans. I obviously said yes … and then a lot of time passed because first Grant was busy, then we decided to wait for the SQL Server 2014 changes to become clear, and then yet other things intervened.
Actual work started just over three years ago. If I had known back then that it would take this long, that I would spend so much time on this book, would I still have accepted?
It was an amazing journey. I learned more than I had ever anticipated; it was an absolute honor to be allowed to work with Grant; and it was a huge pleasure to see how much the quality of the book has improved. Partly due to my comments, but mostly obviously due to Grant. His first draft was already a major step up from the second edition and the way he took my feedback and ran with it was simply amazing.
Six years in the making. Three years of actual work. And an end product that is worth it!
The third edition of Grant’s book has something to offer for everyone. If you are just getting your feet wet with execution plans, then it will take you by the arm and guide you. Once you have done your first steps, it will accompany you along the way. But even if you already have years of experience with execution plans, I’m sure you will learn things you didn’t know already.
Oh, and did I already mention that it’s free? Yes, you read this right: it is free. Totally free of charge. Just click here and you can download it as a PDF.
Thinking about this topic, I remembered two moments in my career where I really was stuck … and then got unstuck in two very different ways.
The unconscious solution
The first event I want to share took place a long time ago. It was early in my career, my first programming job in fact. My job was to maintain and improve some programs, written in BASIC, for the administration of a laboratory.
I no longer remember what the issue was that had me (figuratively) banging my head against the wall. But I do remember the circumstances. It was a Friday, I was working on a program and I simply did not see any solution to the issue I was facing. I really wanted this done. I was eager to get this program done before the weekend, so that I could start on a fresh assignment on Monday. But I simply did not see any way how to solve my problem.
With hindsight, it is probably a good thing I didn’t own a car yet. If I had, I might have been sitting there, struggling to come up with something, until who knows how late. But now I had to leave the office in time to take the last bus home.
On the bus I was probably still mauling the issue in my head. But once home there was distraction. I cooked a meal, ate, flicked my favorite records on the record player, and played some computer games. I went to visit my parents on Saturday and Sunday. All this time I had so much distraction that I never spent one extra second contemplating the problem I had been facing that Friday.
And then on Monday I went back to the office, asked myself “now where was I?”, then remember that I had been very much stuck the Friday before. Anticipating yet another day of banging my head against, the wall, keyboard, and any other available object, I opened the program that I had been working on and the assignment document to remind me what it was I was about to do. I looked at the code. Looked at the assignment. Looked at the code again. And then I laughed at myself for getting stuck at this, because there was such an obvious, simple, elegant solution to this. I started typing, tested the code, and was done in less than half an hour.
I later realized that my mind apparently just needed to relax. When I was consciously trying to force myself to find a solution, my mind blocked. When I started relaxing and doing other things, I gave my subconscious mind the freedom to ponder the problem, without pressure or stress. I do not know whether the solution was born when I was playing games and listening to music, when I was sleeping, or when I was spending time with my parents. But I do know that it was there, ready for the taking, when I was back in the office.
This taught me an important lesson. Software developers (and probably most other professions too, but I’m not speaking for them) have their own version of writer’s block. If I run into that condition, then it doesn’t help to try to force a solution. It is far more efficient to just put the task to rest, pick up other tasks (or some time off), and let my subconscious mind do the hard work.
But taking time off until a solution presents itself is not a magic bullet that can be always used, as I experienced many years later. I was well into my 40’s and I had switched to database development. At that time I was working on a system to facilitate the implementation of rule-based derivation and verification rules. All rules were entered in a database and then the system would make sure that when data changed, derived data would automatically be updated and verifications were checked.
The problem I was facing was caused by the lack of restrictions. It was perfectly fine for users to specify rules where new information was derived from other derived information. My task was to design and write the code to optimize the order in which derivations were executed.
I had already spent multiple weeks on this, and I was getting nowhere. Whatever I tried, I kept getting incorrect results. Rules that should be evaluated were skipped, or were executed before the input data was derived, or sets of rules started running in endless loops. No matter what I did, I was unable to find the root cause. Whenever I found a problem I would issue a fix, only to get two new problems.
Taking my mind off the problem clearly didn’t work. There had been multiple weekends. I had worked on and completed various other tasks. I still ran into that same brick wall every time I returned to this issue.
Solving this was a lot of hard work. What I ended up doing was ditching the computer and the algorithm and resorting to pen and paper instead (well, actually an MS word document). I came up with various examples of what should be valid input to the algorithm. I tried to come up with the ideal order of events myself. And then I finally realized why I was never able to find a solution to the problem – it is because there was none. Even some fairly basic scenarios already caused loops in the ordering, with no way to automated find even a fairly okay order of processing the loop. The basic problem was in the foundation of the tool we were building. The assumption had always been that it should be possible to build this by tracking data that had changed and then processing the affected columns one at a time, in proper order. After several weeks of painstakingly working through countless examples, I was able to prove that this assumption was wrong.
I must admit that I was close to throwing the towel at that time. But I didn’t. I now knew the problem. I knew we had built on the wrong assumption. So why not change the assumption, build a new foundation, and take it from there?
Because I already had put in so much effort, I knew exactly where the old assumption failed. In this specific system, the choice to base the processing order on “processing one column at a time” was wrong. Instead, the processing should be based on “performing one process at a time”, and now I suddenly had a much easier solution to my original problem. Finding the correct order for processes was easy. Yes, there were still cases where I ended up finding a loop but that was always because the input had a loop (did I already say that there were hardly any restrictions on the input?) – so that was okay.
But the entire foundation of how we tracked the information, how the processing was orchestrated, would have to change. That was a lot of work and I did not even want to start it before making sure that it would actually have the intended results. So I went back to my Word document, described how the architecture should look like, then stepped through the processing for all the examples I had used to show that all of them would result in the correct result, in the most efficient processing order.
In the end, I spent several months (!!) analyzing the problem, pondering solutions, discussing options with my coworker, whiteboarding ideas, rejecting or refining as needed, prototyping, and mainly being very frustrated before I finally had the project rolling again. It is not often that one gets to invest so much time in a problem. It is luckily also not often that one needs this much time to find a solution. Let’s just say that this was a rather unique project.
This example did teach me that relaxing doesn’t solve every issue. Sometimes it just takes effort. Hard work, and lots of time are sometimes the only way to overcome an obstacle. And when that obstacle needs to be overcome, putting in the work is your only option.
For our profession, we need a mix of knowledge, experience, and creativity. Especially the latter can’t be forced. So if you feel that you run into a brick wall and make no progress, it can help to take your mind of the problem for a while.
Go fix another problem. Go home at the end of the day, spend time doing things you like. If time permits, move the issue to next week and go have a great weekend. Sometimes, just taking some time off is sufficient. While we are doing other things, some subconscious areas of our brains are still working on the problem. And these areas often operate far more efficient without the rigid force of needing a solution now.
But not every problem has an easy solution. Sometimes, a problem appears to be incredibly hard because … well, because it IS incredibly hard. At those times, there is no substitute for hard work. We just have to roll up our sleeves, pour an extra cup of coffee and start grinding.
This is part eight of the plansplaining series. Each of these posts takes an execution plan with an interesting pattern, and details exactly how that plan works.
In this post we look at a query that, given the known limitations of the available join operators, appears to be impossible to execute. But it’s a legal query so it should run. And it does. But how?
The query below can be executed in any version of the AdventureWorks sample database. Don’t bother understanding the logic, there is none. It is merely constructed to show how SQL Server handles what appears to be an impossible situation.
If you look at the descriptions of the various join operators in the Execution Plan Reference, you will see that this query poses the optimizer for what appears to be an insolvable problem: none of the join operators can be used for this query!
The Nested Loops algorithm can only be used for inner join, left outer join, left semi join, or left anti semi join. A right outer join or right [anti] semi join in the query can still use Nested Loops if the optimizer can reverse the inputs. But this query calls for a full outer join, which is simply not supported by Nested Loops.
Both the Merge Join and the Hash Match algorithms require at least one equality predicate in the join criteria. This query has just a single join criterium and it is an inequality. So these algorithms are also not eligible for this query. And Adaptive Join (not yet described in the Execution Plan Reference) chooses between the Nested Loops and Hash Match algorithm at runtime, so it suffers from the restrictions of both.
The query is legal syntax and should run. And it does indeed. Let’s look at the execution plan to find out how the optimizer solved this conundrum. (As usual, I have edited the NodeID numbers of each operator into the screenshot for easy reference).
Apparently, trying to make SQL Server do the impossible merely results in making SQL Server do a lot of extra work. Instead of two scans and a single join, as we would expect in most cases of a single two-table join query, we now see four scans, two joins, and some additional extra work. Time to break it down and see what’s going on!
The concatenation operator has a quite simple operation. It has two or more inputs and a single output. All it does it request rows from its first input and pass them unchanged; then if the first input is exhausted repeat the same for the second input; and so on until it has processed all of its inputs. In short, it has the same function in an execution plan that UNION ALL has in a T-SQL query, except that UNION ALL always concatenates the results of two queries, and Concatenation can concatenate as many inputs as needed.
In this case there are exactly two inputs, so we can immediately see that the execution plan consists of two independent sub-trees, and their results are concatenated and then returned to the client.
The top input
The first input to the Concatenation operator consists of three operators: Index Scan #2 which reads columns DepartmentID and Name from the table with alias d1; Clustered Index Scan #3 which reads columns DepartmentID and GroupName from the table with alias d2; and Nested Loops #1 to join them together, using logical join operation Left Outer Join.
I am not going to spend much time on this segment of the execution plan. It does exactly what it appears to do: read the two input tables and produce a left outer join result of the data. If you change FULL OUTER JOIN in the original query to LEFT OUTER JOIN, you will get an execution plan that looks exactly like this segment.
The observing reader might already have a clue at this time what is going on. Just think about the definition of what a full outer join should return – basically, the matching rows joined as normal (same as an inner join), plus the unmatched rows from the first table (with nulls for columns from the second table), plus the unmatched rows from the second table (with nulls for columns from the first table). A left outer join, as implemented here, returns exactly the first two items from this list.
The bottom input
Since the first input for Concatenation #0 already returns two of the three items from the list above, the second input should be expected to return the rest: rows from the second table (d2) that have no match in the first (d1). That sounds like a NOT EXISTS in T-SQL terminology, or a left anti semi join in execution plan terms.
Looking at the second input, you can indeed see elements of this expected left anti semi join, but there are a few extra operators as well. So let’s break this down in its parts.
Clustered Index Scan #6
This operator reads DepartmentID and GroupName from the table with alias d2.(This can be seen by reviewing its properties, specifically the Output List property – I decided to not include screenshots of all property lists in this post, merely the more interesting ones).
No big surprises here.
Nested Loops #5
As you can see in the properties, this operator uses logical operation Left Anti Semi Join. This means that it looks for rows in the second input that match a specific condition; if it finds a match the row from the first input is discarded and the scan of the second input is aborted; if the scan completes without finding a match the row from the first input is passed. This is equivalent to using the NOT EXISTS keyword in a query.
The properties of this operator reveal that there is no Outer References property. This implies that the inner (bottom) input is “static” – it returns the same set of rows for every row from the outer (top) input. For each of these rows the Predicate property is evaluated to determine whether the row is a match of not.
This predicate reads (simplified) d2.DepartmentID > Expr1002. This is similar to the predicate in our query but d1.DepartmentID has been replaced by Expr1002. Why?
Clustered Index Scan #8
As expected, this operator reads the table with alias d1, returning only the DepartmentID column (which makes sense because for this part of the query the Name column from the d1 table is not needed).
The optimizer has set the Ordered property to false, so the storage engine is free to return the rows in any order it sees fit. (The same is the case for all other scan operators in this execution plan but here it is a bit more relevant as we will see later).
Stream Aggregate #7
The properties of this operator (no screenshot shown) reveal that there is no Group By property. As explained on the Stream Aggregate page in the Execution Plan Reference, this means that it functions as a scalar aggregate. It returns a single row with the aggregation result of its entire input.
What type of aggregation is used can be seed in the Defined Values property (see screenshot). As you can see, this is where the Expr1002 column we saw referenced in Nested Loops #5 originates from. Simplified by removing unneeded four-part naming, parentheses, and brackets, the expression reads: “Expr1002 = MIN(d1.DepartmentID)”.
Nested Loops #5 (again)
Now that we have seen both inputs and the properties of the join operator itself we can bring these elements together to understand how it works.
All rows from table d2 are read. For each of these, the bottom input returns a single row with the lowest value of d1.DepartmentID, which is then compared to d2.DepartmentID to determine whether or not the row from d2 should be returned.
Since this is a left anti semi join, with a predicate d2.DepartmentID > d1.Department, the transformation to d2.DepartmentID > MIN(d1.DepartmentID) is indeed correct; it does not change the results in any way. It only affects performance. For the query as written, the bottom input of Nested Loops #5 (which executes 16 times) reads 16 rows, finds the minimum value, and then compares that single value to d2.DepartmentID in the join operator. Without the Stream Aggregate operator, it would instead read the same 16 rows, return all of them to the join operator and compare them there. Apparently, the optimizer still estimates that aggregating the data first is cheaper. (I personally doubt this – after all, without the aggregation the lower input would actually stop executing as soon as a match is found; in this case that would almost always be after reading the very first row. But even in other cases it is reasonable to assume that there is at least a non-zero chance of having to read less rows if you compare each row and stop after finding a match, as opposed to first reading and aggregating all rows and only then doing the comparison).
Compute Scalar #4
The final operator to discuss is the Compute Scalar. Looking at its properties, it is easy to see what it does: it simply adds an extra column to each row returned from Nested Loops #5; this column is called (simplified) d1.Name and is set to null.
This makes sense. The Nested Loops operator returns only a single column, d2.GroupName. But Concatenate expects all input data streams to return the same amount of columns, so the equivalent of the d1.Name column must be added before the stream can flow into the Concatenate.
Bringing it together
As already mentioned above, the Concatenation operator effectively combines the results of two independent queries. The first, shown on top in the execution plan, performs a left outer join between the two inputs. This returns two of the three elements that the query asks for: al matching rows (joined as normal), and all unmatched rows from the first table. The bottom input of the query then finds only the unmatched rows from the second table, which the Concatenation operator then adds to the results.
Effectively, the execution plan can be thought of as implementing this query, which is semantically equivalent to the original query though more complex:
If you execute the query above and look at its execution plan, you will see that this query does indeed have the exact same execution plan as our original query.
It may appear wasteful to do two joins and four scans (plus the extra aggregation and concatenation) instead of a single join and two scans. But as explained in the introduction, I forced the optimizer to find a workaround because I submitted a query that simply could not be executed as a single join between two tables, due to limitations in the available algorithms. So in that sense, this is really the best the optimizer could do.
That being said, there certainly is room for improvement. I already mentioned above that I doubt whether aggregating all rows to do just a single comparison in the Nested Loops operator is actually more effective then returning individual rows – especially because a Nested Loops with logical operation Left Anti Semi Join can stop calling rows from the inner input after the first match is found.
There are two other optimizations possible that in this case were probably not done because of the small table size (16 rows for both input tables), but that I have actually seen in execution plans for similar queries on much larger tables. The first is based on the observation that Nested Loops #5 uses a static inner input. This means that it will be the same for each execution – so why compute it over and over again? Why not add a Table Spool operator so that the actual computation of the MIN(d1.DepartmentID) is done only once, storing the result stored in a worktable which can then produce it again every execution without having to recompute it? And also, since DepartmentID is the clustered index key, why not change the Ordered property on Clustered Index Scan #8 to true and add a Top operator, so that only the first row (which by definition stores the lowest DepartmentID value) is returned?
However, my biggest disappointment is that the optimizer even chose to use a static inner input at all. I really would have expected to see a dynamic inner input on Nested Loops #5. The Outer References property would have been set to d2.DepartmentID. And the entire inner input would have been replaced by a single Clustered Index Seek operator on table d1, with its Seek Predicates property set to d2.DepartmentID > d1.DepartmentID. Each execution of this inner input would then navigate the index to the first matching row and either return nothing (if no match exists) or return the first match – and then not be called again so any other matching rows are not returned.
Long story short, I would have preferred to get the execution plan that it is actually produced for the (equivalent) query below:
All join operators have their specific limitations. Due to these, it is possible to write a query that at first glance appears impossible to run. But as long as it is a syntactically correct query, SQL Server has to be able to run it.
In this plan we looked at the hoops SQL Server has to jump to in one of such cases, where the inequality-based join predicate forces a Nested Loops join, but the query ask for a join type that is not supported by Nested Loops. We saw that in this case the query effectively gets rewritten to an equivalent but more complex query that can use supported operators.
I am still open for suggestions on topics to cover in this series. If you see an unusual and interesting pattern in an execution plan, let me know so I can consider using it in a future post!
Almost every conference I go to, either as speaker or as an attendee, indicates the level of sessions. Sometimes this is done with a description (beginner / intermediate / advanced), more often with a number (typically 100 – 200 – 300 – 400 – 500, though I have seen other scales as well).
This is obviously a good thing to do. If you are just new in the field, attending a highly advanced session would be a waste of time. And if you have many years of experience then you will probably not learn much from a beginner-level session.
But there are some issues with these level indicators. I do not pretend that I have the answers – in fact, I am sure I don’t. But I do wish to share my thoughts.
As mentioned above, session levels are typically represented with either a description or with a number. Descriptions are typically given as “beginner / intermediate / advanced”. While that sounds pretty clear, there is still a lot of room for interpretation. When exactly does one make the transition from beginner to intermediate, and from intermediate to advanced? Some people think the beginner tag only applies for a few months, others use it for a few years. Some people will never label themselves as advanced, no matter how much experience they have.
The numerical labels are even worse. I hear people utter phrases as “that is level 300 session”, but what exactly is “level 300”? PASS has made an attempt to clarify this here, with a one-word description, an indication of experience, and a short explanation. Here are the levels with their description and experience indication (follow the link above to read the longer explanations):
100 = novice, up to 1 year
200 = intermediate, 1 – 3 years
300 = advanced, 4 – 6 years
400 = expert, 6+ years
500 = advanced level, 8+ years
The combination of the number and the explanation makes this actually better than the single-word description some other conferences use. But unfortunately, it is limited to just the PASS Summit. Other conferences use other definitions for the same 3-digit numbers. I have seen conferences where level 100 is defined as “management-level overview, please do not use this level for technical content”. I have seen conferences that use level 400 as the highest level.
In other words, in most cases the three-digit level indicator is even worse than the one-word descriptions.
It’s all relative
Another problem is that experience is not measured in just years. There is a huge difference between having five years of experience in a single job at a single department of the same company, doing the same thing over and over; or having five years of experience in various roles within the same or different companies. There can also be a significant difference between two persons in the exact same job if one of them is so eager to learn more that she spends lots of her free time at conferences, reading blogs, or playing with new feature on her own laptop and the other one closes off at five and spends his spare time doing things not related to his job at all. Plus we are all different, even when doing the exact same thing for the exact same time, one person learns quicker than the other.
That’s why the amount of experience that PASS (and probably other events as well) lists for each level should be taken as a guideline, not as a strict rule. Some people with only 4 years of experience will still be comfortable in a level 400 session. Others may have 9 years of experience but struggle to keep up in that same session.
Speaker and attendee may well have a different interpretation of what does and doesn’t count as experience. Let’s look at a level 200 session on log shipping. According to PASS, this assumes 1 – 3 years of experience. But experience with what? Do you start counting when you first touched a database? When you started your first job as SQL Server DBA? Or when you first started using the specific technology covered (log shipping in this case).
Perhaps the speaker has created a session that targets people who have never used log shipping but are interested in the feature. But they didn’t want to spend valuable session time on explaining how full recovery works and what log backups are, so they set the level of the session at 200, thinking “attendees with 1 – 3 years of experience as a DBA will know what log backups are and how to schedule them; I can use that as my starting point and build on that to show log shipping”
And perhaps an attendee with 7 years DBA experience sees that session and thinks “hmmm, my company could benefit from this technique but I have never used this, I actually need a level 100 session on log shipping because I have no experience at all in that field”.
Content vs delivery
Perhaps my biggest gripe with session levels though is that it is unclear whether the level describes the content of the session, or the delivery. I like to believe that this is not a very important distinction at the more beginner-oriented levels. But at advanced levels it does make quite a difference.
For example, last year at SQL Saturday Holland I delivered a session titled “Hash Match, the Operator”. That session not only covers what the Hash Match operator does and how it works, but it also goes into undocumented details such as how exactly the hash table is built, how the memory grant is computed, and all the phases of a hash spill situation. You may think that this description of Hash Match is deep, but that session is even deeper. So I have no doubt at all as to the content of this session: it is definitely level 500.
But how about the delivery? In other words, what did I assume my audience already knew, what did I expect them not to know, and in what way did I try to convey that knowledge? If I had purely targeted a 500 level audience, I should have assumed 8+ years of relevant experience. So that means that I did not have to explain what Hash Match does, what a hash function and a hash table are, or how a Left Anti Semi Join differs from a Right Outer Join. And I could have kept my explanations fast and high-paced, describing complex pointer structures within the hash table architecture with just a few words.
I chose different. I expected my audience to be aware what Hash Match does, but not necessarily already know what a hash function and a hash table are. I expected them to know what pointers are but still provided visual representations of a hash table to help them better understand the architecture. I chose to take a very advanced and very complex (and very specialized) topic, yet explain it in a way that I think that most of the 300-level audience should be able to follow.
Some people were probably happy with this choice. They might have struggled if I had skipped the first part of the session. Other people were probably unhappy, they might have felt that if I had not wasted so much time on explaining the basics, or on visualizing things they would understand from a quick remark, I would have been able to cover more than I did now. There is no way to please everyone.
But even more important in the context of this post: there is no way in just a single level indication to clarify which part of the audience I aim to please the most.
A single keyword- or number-based indication of session level is a good way to give the audience a quick indication of what to expect. But it is not more than just a quick indication and it has many shortcomings.
Speakers should get into the habit of giving a clear, longer explanation of what level of content the audience can expect, and what level of expertise they expect the audience to already have. Conference organizers should make sure that the forms where speakers submit an abstract allow for enough characters for the speakers to add this information. Conference organizers should also ensure that this information is easily accessible to their audience. And conference attendees then need to look beyond just the title, presenter and level indication of a session, but also check the full abstract (that then hopefully includes the information they need to check whether or not they are the intended audience).
(EDIT: Fixed three typos – thanks Kalen for pointing them out to me!)
In 2009, Adam Machanic (b|t) started the monthly blog event known as T-SQL Tuesday. One person picks a topic, and everyone is invited to blog about it on the second Tuesday of the month. The July 2018 host is Bert Wagner (b|t), and his chosen topic is: Code You Would Hate To Live Without. Talk about a broad topic! I could fill a book on that topic, and the same probably goes for many others.
People who know me or follow my writings will know, or at least suspect, that database administration is not my primary interest. I see myself as a database developer, focused on writing T-SQL code to implement new solutions, debugging existing T-SQL code, and of course optimizing the performance of slow T-SQL code.
But several of my past and present customers are not large enough for separate DBA and DB-dev positions. So they ask me to do the DBA duties “on the side”. And since these are usually fairly simple infrastructures, with just a few servers, and no complex matters such as Always On Availability Groups, replication, or SAN administrators, I can handle those responsibilities.
One of my first actions for such clients is always to first review their maintenance setup. The second is usually to get rid of however they had it set up and replace it with something much better: Ola Hallengren’s database maintenance scripts.
That all went well until I got hired by a customer running BizTalk. The “problem” with BizTalk is that it comes with its own backup jobs, and not using those for backups invalidates support. So while I have my own personal opinions about these special BizTalk backup jobs, I can’t deny the relevant of support.
Unfortunately, the BizTalk backup job only runs backups for the databases that are part of the BizTalk setup. Other databases on the server are ignored; keeping them safe is suddenly on my again. Now if you add in that other people can create databases, either through BizTalk (that will then be maintained by BizTalk), or directly (and not included in the BizTalk backup logic), you may start to see how it can be … challenging to make sure that I have backups for exactly all the right databases on the server.
Eventually I found a good solution to this problem. A few lines of code to enhance Ola (or at least his backup script). Here is a sample of the code that I now use for the backup jobs on all BizTalk instances (this one is for the full backup of all databases; I am sure you can work out the changes in the jobs for differential and log backups):
-- Ensure that databases under BizTalk control are not backed up by this job
-- Back up
What this script does is:
Check the table that the BizTalk backup job uses to find which database it needs to process.
Create a string variable that reads “ALL_DATABASES,-db1,-db2,…” (using all those BizTalk managed databases in the concatenation).
Pass that string to Ola’s backup script, which tells it to run a backup for all databases except those in the enumeration.
This works perfect for me. When a new BizTalk database is created, BizTalk itself adds it to the admv_BackupDatabases table and Ola’s job will never touch it. When a new non-BizTalk database is created, Ola’s own logic will pick it up for processing. And I can sit back and relax, knowing that each database on this server is safely backed up by exactly one of the two backup procedures.
In 2009, Adam Machanic (b|t) started the monthly blog event known as T-SQL Tuesday. One person picks a topic, and everyone is invited to blog about it on the second Tuesday of the month. The July 2018 host is Bert Wagner (b|t), and his chosen topic is: Code You Would Hate To Live Without. Talk about a broad topic! I could fill a book on that topic, and the same probably goes for many others.
But since my time is limited, I have decided to cherry-pick just two of my favorite scripts. This post is about the first: sp_metasearch.
This might come as a surprise to some. There are actually companies that do not have perfect documentation of everything they have in their database! And for some reason, I always find these companies as my customers. And that means that, before I can make an innocent change such as changing a data type or dropping a few tables, I need to have a way to check which objects might be affected.
The code to search the DMOs is not very hard. But typing it over and over again quickly becomes tedious. That’s why I decided to turn this in a stored procedure.
Here is the code to create the stored procedure sp_metasearch. Most of the code is fairly simple if you know the structure of the various dynamic management objects used. I will briefly touch on some points below the code.
-- Create the stored procedure in the master database
-- Yes, I use the sp_ prefix. Yes, this is deliberate. See explanation in blog.
You may notice that (1) I create the stored procedure in the master database; (2) its name starts with the “forbidden” sp_ prefix, and (3) I use the undocumented stored procedure sp_MS_marksystemobject to make SQL Server believe that this is a system object. I do that deliberately, because the exact combination of these three ingredients allows me to call the stored procedure from any database and have it execute within the context of that database, which for this specific functionality is a great feature! But yes, I am aware of all the risks I take doing this, and so should you if you decide to copy this code snippet!
Example usage and results
Once the stored procedure is created, using it is easy. Let’s say I am working on the AdventureWorks2017 database and I need to change the data type of the SubTotal column in table Purchasing.PurchaseOrderHeader. Before I do that I want to review all code that might potentially be affected so I open a query window and execute this code:
The results on my system are as follows:
The two result sets show me all objects I need to verify. For triggers, the ParentObjectName tells me what table the trigger is defined on. The results are ordered in a way that make it easy for me to work down the list of items to check in SSMS.
Limitations and variations
I wrote this code for personal use, and as such it has a number of limitations that you need to be aware of if you want to “steal” it. I also often write variations on this code for other customers, to make it more suited to their environment.
As you have seen in the code, the search uses a simple brute-force text comparison. A benefit of this is that I can search for everything: table and column names, but also subsets of those names, longer phrases, or whatever else I fancy (e.g. “—TODO” to find all my unfinished business!!).
But this strength is also a weakness because it increases the risk of false positives, for instance in code that was removed by commenting, or when a search phrase is part of a common word. One day a customer wanted to change the codes they use in a code table for product types and they were unsure whether there might be hard-coded references to these codes. One of the codes was “PDA”. Using sp_metasearch to find potential locations returned a huge list: each and every stored procedure that includes the keyword UPDATE. Fun times!
You may also have noticed that this code limits the search to the current database only, and that it only searches in SQL Agent jobs that start in the current database. That made sense at the customer where I first developed and used this stored procedure, where each database was completely self-contained and there were no interactions between databases.
I have also worked at customers where cross-database queries were common; here it made more sense to remove the database filter in the second query, and to use sp_foreachdb to search for objects in all databases on the instance, so that all databases on the current instance are searched. At one of my current customers I even use a central management server to connect a single window to all relevant servers and execute (a variation on) this code on multiple instances at once.
There are also some limitations in where I do and do not search, because I built this just for my needs. I do search the DMO “all_sql_modules”, which contains the text of stored procedures, views, user-defined functions, and triggers. I also search in the commands of all SQL Agent jobs. But that’s it. I will never find matches to the search string in your personal library of helpful T-SQL and Powershell scripts. Or in the SSIS catalog. Or in the SSRS report collection. Or in synonyms. And depending on how you exactly a dynamic SQL string is built, I miss that too.
And those are just the omissions that I am aware of, there may be other locations where you could find code that I have never run into yet.
It’s not that these issues are unfixable. Most can be fixed (though searching your personal library of SQL scripts might pose a bit of an issue). I just have not yet worked for customers where I needed to do this. I know this when I use sp_metasearch so it won’t catch me by surprise. And now you know this as well!
If you like what sp_metasearch can do for you, then feel free to copy and use it. If you see errors in the code, or if you extend the code to search in other locations as well, then I would appreciate hearing from you. Please use the comment form below to share how you used or improved this code!
This is part seven of the plansplaining series. Each of these posts takes an execution plan with an interesting pattern, and details exactly how that plan works.
In this post we look at a deceptively simple query: a simple SELECT with an ISNULL to show either a row returned or a placeholder value. And yet there is more going on under the covers than one might expect.
The query below can be executed in any version of the AdventureWorks sample database. It returns a code string representing the version number on the 2016 and 2017 versions; on older versions it instead returns the hardcoded text “Older version”.
So what do we have here? A straightforward single-table query with a very simple predicate, embedded in an ISNULL function call. You would probably expect some scan or seek plus a Top operator to evaluate the subquery, plus a Compute Scalar for the ISNULL. In reality the plan is not quite as simple, as shown below. (As in earlier episodes, I have added the NodeID numbers into the execution plan to make it easier to reference individual operators).
We do indeed see a Top (#3), a scan (#4), and a Compute Scalar (#0), but there are also two extra, unexpected operators: Nested Loops (#1), and Constant Scan (#2). Why are those in the execution plan?
Constant Scan #2
In part 3 of this series I write: “Whenever we see a non-obvious Constant Scan operator, out first stop should always be its properties, specifically the Output List and Values properties”. Well, this Constant Scan is definitely not obvious, so let’s look at its properties.
Now there is an Output List property, but it is completely empty. And there is not even a Values property at all. That’s … odd! Does this mean that this Constant Scan returns nothing? Not exactly. A Constant Scan always returns at least one row. It normally returns one or more rows, with columns as defined in the Output List property and values as defined in the Values property. But when those properties are missing, it will still return a single row.
That single row is special. It is a type of row that we would never be able to use in our queries; it can only exist in intermediate phases in an execution plan, because it has zero column. This can be seen in the properties if you look at the Estimated Row Size property, which is 9 bytes. That is exactly the size of the row header within execution plans, a small area where metadata of each row is described (very similar to the row header you will find on data pages for rows stored in tables). The Actual Number of Rows property does indeed confirm that a single row was actually returned by this operator.
If you look at many execution plans you will notice this usage of Constant Scan more often, usually combined with a Compute Scalar. Constant Scan cannot do computations, but Compute Scalar needs an input row to start working and to store its result in; the combination of the two works fine: Constant Scan generates an “empty” row and Compute Scalar then adds data to it.
In this case there is no Compute Scalar to the direct left of Constant Scan though. So let’s see what happens next.
Nested Loops #1
If you read the previous parts you already know that execution doesn’t start at Constant Scan. It starts at the far left at SELECT, which calls Top; Top then calls Nested Loops, which calls Constant Scan. As soon as Constant Scan returns its first row (we already know that it’s the only row but Nested Loops does not know or care about that yet), Nested Loops calls into its inner (lower) input by requesting a row from the Top operator.
Top #3 and Clustered Index Scan #4
This part of the execution plan is pretty obvious so I’m not going to cover it in detail. When Top is called, it first calls the Clustered Index Scan operator. This operator has a pushed down search predicate for [Database Version] >= N’13’. Since the AWBuildVersion table in AdventureWorks always contains exactly one row, this scan will either return that row, or no data at all (in which case an “end of data” signal is returned instead).
If a row is returned, Top will pass that row unchanged to Nested Loops. If no row was returned from Clustered Index Scan to Top, then Top will obviously also not return a row to Nested Loops.
Nested Loops #1 (continued)
At this point we know that the Nested Loops operator was called, it then first called its outer input, Constant Scan, to receive an empty row; it then proceeded to call its inner input, Top, from where it may or may not (depending on the version of AdventureWorks) have received a row.
The next thing to do is to check the join condition (in the Predicate property). In this case that property is not present, which means that every row is considered a match. So if Top returns a row, it is considered a match for the empty row returned from Constant Scan and a row is returned to Compute Scalar; if Top returns no row then it is not a match but because the Logical Operation is a Left Outer Join it will still return a row, using Null values for columns originating from the inner input.
The Output List shows that the row returned contains just a single column: [Database Version]. On a newer version of AdventureWorks, this will indeed be the database version from the AWBuildVersion table; on older versions the inner input returns no row so the Nested Loops will provide a Null value in this column.
The Compute Scalar operator (#0) uses a simple isnull expression to replace Null values with the ‘Older version’ text from the query while letting the actual database version on newer builds pass unchanged. (Because of this simplicity I decided not to add a screenshot for this). This row is then returned to the client. Because the client does not know that this will always be a single-row result, it then calls Compute Scalar again, which calls Nested Loops. At this point there are two options.
On newer versions, the inner input of Nested Loops returned a row, so Nested Loops calls it again to check if there are more matching rows. The first operator in that branch is Top, and its Top Expression property is (1). Since it already returned a row it will not bother to call the Clustered Index Scan operator again but immediately return “end of data”. At that point Nested Loops will switch back to the outer input (the Constant Scan operator). But because Constant Scan has no Values property, it will only return a single empty row, not multiple – on this second call it will return “end of data”, so now Nested Loops also returns “end of data” which then propagates through the rest of the plan.
On older versions of AdventureWorks, the inner input had already returned “end of data” after the first call, so in this case the Nested Loops operator will immediately return to the outer input. Obviously with the same end result.
But … why?
All the above explains how the execution plan for this query works. But it doesn’t really explain why these extra operators were needed. Wouldn’t SQL Server have been able to return the same result by simplifying the execution plan?
The picture above is not captured from SQL Server; I created it using copy and paste on the screenshot of the original execution plan, to illustrate the type of execution plan many people would have expected for our sample query. Why was this much simpler execution plan not chosen?
If you somehow manage to force SQL Server to run the above execution plan in an AdventureWorks2016 (or 2017) database, you will actually get the expected result. And that makes sense if you follow along: operators call each other left to right until the Clustered Index Scan starts, which finds a row to return. Top passes that row because its only task is to stop execution AFTER the first row. Compute Scalar then applies the isnull expression and your database version is returned; Top then ensures that no second row is returned even if more data would exist in the AWBuildVersion table.
But this same execution plan on AdventureWorks2014 (or older) would behave differently. The Clustered Index Scan operator would not find a row to return. This means that “end of data” flows from the scan through Top, Compute Scalar, back to SELECT and you would get an empty result set instead of the single row with the text “Older version”.
At this point you might be wondering “but what about the isnull expression in the Compute Scalar operator?” Understandable. But here is a very important thing to remember: almost all operators in execution plans operate only on the rows they receive. Scan and seek operators receive rows from the storage subsystem; Constant Scan receives rows from its own properties; all other operators receive rows from their child operators in the plan. The only exceptions I know of are a Constant Scan with no Values property (which returns a single empty row), and a Stream Aggregate operator with no Group By property(which still needs an input but returns a single row if its input is empty).
Compute Scalar is not an exception. Compute Scalar computes its expressions for each row it receives, adds the computed values and then returns that row. If it never receives a row, then it will not compute its expressions, because there is no input data to use in those expressions and no row to add these values to. That’s why the faked execution plan above would not return the expected result.
A Constant Scan operator normally returns one or more rows with one of more columns of data, as defined in the Output List and Values properties. However, Constant Scan can also be used with an empty Output List and no Values property. In that case it will return a single row that has no columns.
This empty row can be used to generate a placeholder that other operators (often Compute Scalar) can then store values in. It can also simply be used to ensure that other operators actually receive a row of data, since that is for most operators the only way to get them to do any actual work.
The example in this blog post combined both: the Nested Loops operator needed a row from its outer input to make it start reading from its inner input, and then the data returned from that inner input was added to that empty row and passed to other operators to do the rest of the work.
I am still open for suggestions on topics to cover in this series. If you see an unusual and interesting pattern in an execution plan, let me know so I can consider using it in a future post.
If no suggestions come in, then episode 8 will probably focus on recursion.
This is the sixth post in the plansplaining series. Each of these blog posts focuses on a sample execution plan that exposes an uncommon and interesting pattern, and details exactly how that plan works.
In the first post, I covered each individual step of each operator in great detail, to make sure that everyone understands exactly how operators work in the pull-based execution plans. In this post (and all future installments), I will leave out the details that I now assume to be known to my readers. If you did not read part 1 already, I suggest you start there.
In this post I will take a look at a simple usage of the OVER clause with a not-so-simple execution plan. This particular usage of OVER has been available since SQL Server 2005. It allows us to mix detail and aggregated information in a single query. And I’ll have to admit that this particular execution plan pattern has had me baffled for more than a year before I finally understood exactly how it works.
For this example I once more use the AdventureWorks sample database. I tested this on SQL Server 2017, but I have seen the same execution plan pattern in all versions since at least SQL Server 2012.
The query above produces a very simple report, showing salespersons and their Sales YTD figure, compared to the total Sales YTD for all salespersons in their territory. The latter figure is computed by the OVER clause, as documented here.
When SQL Server 2005 released, everyone was excited about the shear simplicity of this solution, which previously required a correlated subquery. But a simple query does not necessarily equate to a simple execution plan, as you can see below. (As in the previous posts, I have added the node ID of each operator in the figure for easy reference in the rest of the text).
This appears to be an outrageously complex execution plan for such a simple query. The query reads from a single table, with no filters, no joins, and no ORDER BY. And yet the execution plan uses two join operators and a sort, plus a total of three Table Spool operators.
Time to jump in at the deep end and see what is going on in this execution plan!
Let’s start at the very beginning
… And that beginning is in this case the far right of the top branch of the execution plan. Obviously, just as every other execution plan, this plan really starts executing at the top left. The SELECT pseudo-operator calls Nested Loops #0 requesting a row. Nested Loops then requests a row from Table Spool #1 and this continues until the far end of the upper branch: Clustered Index Scan #4. But this is where the real action starts, as this operator reads a row from the SalesPerson table and returns it. The Ordered property is set to False because the optimizer does not care about order. No surprise, since the rows are immediately fed into a Sort operator (#3).
After receiving this first row, Sort #3 will store it in memory and then immediately call Clustered Index Scan #4 again. This repeats until the entire table has been read, delivered to Sort #3, and stored in memory. (I am not going into the details of spills at this time). Once all data is memory, the operator sorts the rows.
The properties of Sort #3 shows that this sorting is done by TerritoryID. It is not a far stretch to assume that this is done to prepare for calculating the total SalesYTD for each territory, as specified in the PARTITION specification of the OVER clause. However, at this point this is an educated guess. I need to keep in mind while inspecting the rest of the plan that I might be wrong.
All this work, fetching all 17 rows and sorting them by territory, has to be done first. Once done, Sort #3 can finally return the first row to Segment #2. That’s why Sort is considered a blocking operator. If the input of the Sort is a huge tree of joins between large tables that takes 5 minutes to evaluate, the blocking behavior means that 5 minutes will pass before the first row is returned from the Sort. Later requests require less work: once the sorted results are in memory, each subsequent call can immediately be returned.
The sorted rows are delivered to Segment #2. I recently blogged about this operator so I will not spend a lot of words in this. Its property list shows that it is adding a segment column Segment1002 which marks every row where the TerritorySales is not equal to the previous row. This, too, appears to be related to the PARTITION specification in the OVER clause (but the same caveat applies).
Table Spool #1
So far, I have not covered anything special or unusual. It may not yet be perfectly clear why this work is being done, but it’s not hard to see what these operators do.
The actual fun starts at Table Spool #1. We have seen a Table Spool before but I need to provide a bit extra context here. According to the available documentation and other resources, Table Spool has two logical operations: Lazy Spool and Eager Spool.
An Eager Spool operation is blocking, just as the Sort we saw before. It first consumes its entire input and stores it (in a worktable, unlike Sort which uses memory). After that, it returns those rows, one by one, to its parent operator. The operator can later be initialized again, with the request to produce the same set of rows (“rewind”). In that case it will produce them by reading from the worktable. It can also be initialized again as a rebind, meaning that the worktable is emptied, and it again requests rows from its child, stores them, and returns them after processing the entire input.
A Lazy Spool has the same basic functionality of storing the rows it reads into a worktable, from which it can then later return the same rows again. Unlike an Eager Spool though, a Lazy Spool is not blocking. It will not try to fill the entire worktable first. It simply requests a row, stores it, and returns it; then repeats that for the next row. The rebind and rewind functionality are exactly the same.
This Table Spool is marked as a Lazy Spool, so it is non-blocking. The properties show a single execution, which is a rebind (which is always the case for the first execution). There are no rewinds, so you might wonder why the optimizer adds the extra work of storing all rows in a worktable when they are not processed a second time. We’ll see that later.
The missing link
However, the actual catch is that the described Lazy Spool behavior does not actually explain how this execution plan works. One essential piece of the puzzle is missing. It took me well over a year (and for the record, I was not investigating on a daily basis!) before I finally realized what I missed. The key clue for this was in the rowcounts, as shown in the screenshot below (edited for readability).
A Table Spool, working in Lazy Spool mode, and executed exactly once, would normally always return exactly as many rows as it receives. That is not the case here. It reads 17 rows, which corresponds to the number of rows in the input data. However, it returns just 12 rows. If you look at the results of the query, you might notice that this number is equal to the number of distinct TerritoryID values plus one.
After looking at multiple other execution plans with this pattern, it is clear that the number of rows going in the Table Spool is always equal to the data size, and the number of rows it returns is always equal to the number of distinct values of the PARTITION BY column (or columns). That can’t be a coincidence!
At this point I realized that the only possible explanation can be that the Table Spool operator is segment aware. As explained in an earlier post, an operator that is segment aware changes behavior if it is bound to a segment column. In plans with multiple Segment operators it can be a challenge to figure out which segment column it is bound to because this is not in any way represented in the execution plan. But in this simple case there is only one option. The Table Spool operator can only be tied to the segment column produced by Segment #2.
When tied to a segment column, Table Spool works in a way that falls somewhere in between the Lazy Spool and Eager Spool behaviors. When called, it repeatedly calls its child node to get a row, similar to an Eager Spool. However, instead of processing the entire input, it only processes a single segment. After it receives (and stores in its worktable) all rows for a segment, it then returns just a single row to its parent. This single row represents the entire segment.
So when execution of this query starts and Table Spool #1 is called the first time, it calls Segment #2 to grab a row and store it, then repeats that until a new segment starts. We know from the results that the first segment represents all rows with TerritoryID NULL, of which there are three. Those three rows are read and stored, and then a single row to represent the segment for TerritoryID NULL is returned to Nested Loops #0. That operator then uses its inner (lower) input to produce the full set of three rows for the segment (as will be described below). After that Nested Loops #0 once more calls Table Spool #1, which clears the worktable and then repeats the process for TerritoryID 1.
Nested Loops #0
So far we have established that the outer (top) input of Nested Loops #0 produces one row for each segment/territory. This implies that the inner (lower) input executes once for each segment. Since this operator returns its data directly to the SELECT pseudo-operator, it has to return the final result set. With one row per segment coming from the outer input, this means that the full result set (except, perhaps, the TerritoryID column) has to come from that inner input.
Before diving into the details of that inner input, I want to point out one oddity of this Nested Loops operator. Normally, a Nested Loops operator will have either an Outer References property or a Predicate property. This one has neither.
An Outer References property lists the columns that, coming from the outer input, are pushed into the inner input. Every time these values change, the logic in the inner input ensures that a different set of rows is returned: only the rows that are correct matches for these columns. Because the inner input is custom-tailored, the operator can assume that every row returned is a match; no Predicate property is needed.
When no Outer References property is present, nothing from the outer input is pushed into the inner input. Except for concurrency issues (and one specific case that I’ll address shortly), the same rows will be returned over and over. In this case, the Predicate property is used so the Nested Loops operator can distinguish matching from non-matching rows.
In this case, Outer References and Predicate are both missing. This is normally only seen for a cross join. However, I have looked at the output and I am pretty sure that it’s not eleven identical copies of any data. This execution plan is a very specific case where, without Outer References, each execution of the inner input still returns different data. This is accomplished by the segment-aware operation of Table Spool #1 in combination with the two Table Spool operators in the lower input.
Inner input of Nested Loops #0
For understanding the inner (bottom) input of Nested Loops #0, I will track first execution in detail. This is for the first segment, TerritoryID NULL, consisting of three rows.
When the inner input for this segment starts, Nested Loops #5 is called. This operator then calls its outer input. Compute Scalar #6 calls Stream Aggregate #7, which in turn calls Table Spool #8.
Table Spool #8
In the description of the Table Spool operator above I state that it requests data from its child nodes, stores it in a worktable, and returns this data. Possibly multiple times. But Table Spool #8 does not even have a child operator. So where does this operator get its data from?
This is in fact yet another way that Table Spool can operate. There is, as far as I know, no official name for this. I call it a Table Spool operating in “consumer-only” mode, because it consumes data from a worktable that is produced by another Table Spool. You can also see this in the properties. The presence of a Primary Node ID property indicates that this is a consumer-spool. The value of this property indicates which spool’s worktable it uses. In this case the value is 1, so Table Spool #8 returns data from the worktable that is created by Table Spool #1.
We saw earlier that, for the first segment, Table Spool #1 contains three rows. The three rows in the input table that have TerritoryID NULL. When Table Spool #8 is called it returns the first of these rows. When called again it returns the second, and so on. On the fourth call it returns an end of data signal.
Stream Aggregate #7
Table Spool #8 returns its rows to Stream Aggregate #7. In this execution plan, this Stream Aggregate operator has no Group By column. This means that it produces a scalar aggregate, a single row with aggregated data for the entire set. The Defined Values and Output List properties expose that it computes and returns the number of rows (3) and the sum of the SalesYTD values in these three rows.
The function of this operator in this area of the execution plan is very similar to the Stream Aggregate I discussed in the first plansplaining post, so I will not go into details here.
Compute Scalar #6
The single row with aggregated values for the current segment is then passed to Compute Scalar #6. This operator is not very interesting either. The only interesting property is Defined Values. This shows that the only reason this operator exists is to check how many rows are in the current segment. If this number is zero, the result of the sum of all SalesYTD values is changed from whatever the Stream Aggregate operator returns for an empty set to NULL. This is defined as the correct result of a SUM function on an empty set.
Nested Loops #5
For this first segment (and for every other segment, for this is how Stream Aggregate works when doing scalar aggregation), the outer input of Nested Loops #5 returns just a single row. This row contains the total of all SalesYTD values for the current segment (in this case for the three rows with TerritoryID NULL). This single row then drives the inner input of Nested Loops #5, which will add back the original detail data (as I’ll show below).
There is an intriguing difference between the two Nested Loops operators in this plan. Both need to join their outer input to each of the rows from the inner input. In both cases the Table Spool magic ensures that even though there is no Outer References property, the inner input actually changes to the correct set of rows on each next execution. So both these Nested Loops operators need neither Outer References nor Predicate. And yet, they are not the same.
If you look at the properties of Nested Loops #5, you will see that it actually does have a Predicate property. However, the value of this property is a simple constant value: 1. So where Nested Loops #1 has no Predicate to ensure that each row is a match, Nested Loops #5 uses a Predicate with a constant value that converts to the logic value true to achieve the same effect.
Please do not ask me why the optimizer chooses to use these two different techniques to achieve the same effect.
Table Spool #9
The last operator in the execution plan is Table Spool #9. Looking at its properties (not included in this blog post as it would be very repetitive), you can see that this operator also leeches on the hard work done by Table Spool #1. So this operator, again, reads and returns (upon three consecutive calls) the three rows for TerritoryID NULL that were originally read from the base table and stored in a worktable by Table Spool #1.
The Output List property of this Table Spool shows that it returns the three columns we need from the base table in the final output: BusinessEntityID, TerritoryID, and SalesYTD.
We already saw that the outer input of Nested Loops #5 produces a single row with the total sales for TerritoryID NULL. The inner input then, using this consumer Table Spool, adds the three original rows to that. This in effect produces the first three rows of the result set.
Nested Loops #5 returns these three rows (as always, one at a time) to Nested Loops #1. That operator returns them to SELECT so they can be passed to the calling client..
The next iteration
So far, Nested Loops #0 requested and received a single row from its outer input. After a lot of processing that row was returned; we have seen that this row represents the segment for TerritoryID NULL. Nested Loops #0 then requests rows from the inner input. The first three calls produced rows that are part of the result set. The fourth call results in an end-of-data signal. At this point Nested Loops returns to the outer input to request the next row.
I already described how the entire outer input works. The request bubbles down to the Sort operator. (The Clustered Index Scan operator is not used this time. Sort already has all its data in memory so it simply returns the next row when called). All rows for the next segment, for TerritoryID 1, will be stored in the worktable (after clearing out the previous contents) and a single row to represent this segment is returned to Nested Loops #0.
When that operator then once more requests rows from its inner input, the same operators will do the same thing as before. However, because the content of the spooled worktable has been changed in between, those operators now produce different results. This is how, even without an Outer References property, the Nested Loops operators in this execution plan manage to receive a different set of data for every execution of the inner input.
When it comes to execution plans, I am a sucker for details. (Cue sounds of faked surprise from the readers).
The description above sounds very logical. But does it really match with all the details you can glean from the execution plan? Here is the bad news: it doesn’t. Upon very critical inspection, two issues stand out. One is related to the Output List property of the Table Spool operators. The other is related to the Actual Number of Rows returned by Table Spool #1.
Let’s investigate these loose ends and see if there is a logical explanation.
The screenshot on the left shows the Output List property of Table Spool #1. The TerritoryID column is expected: the row returned represents a segment, which equates to a single TerritoryID value. The other columns are surprising. Given that there are for example three rows for the NULL segment, which of the three rows are these values taken from? And why are the other rows not used?
Look at the rest of the plan, and you’ll see that Nested Loops #0 receives identical-named columns from its other input. It then returns columns of these names to its parent. It is in my experience very rare to see duplicated column names in execution plans that do not actually refer to the exact same data, but this is an example where this does happen. I must admit that I do not know how Nested Loops #0 picks which of the two BusinessEntityID, TerritoryID, and SalesYTD columns it returns. But I do know that, at least in this case, it always picks the inner input for at least the BusinessEntityID and SalesYTD columns. That is the only way this execution plan makes sense.
This implies that the columns that Table Spool #1 returns to Nested Loops #0 are effectively ignored. Actually returning them appears to be a waste. Why is this done?
Looking at the other Table Spool operators, they all have the exact same Output List. Totally expected for Table Spool #9: this is the spool that adds back the original rows so all these columns are in fact needed. For Table Spool #8, though, only SalesYTD would be needed; none of the other columns are used by or returned from its parent operator (Stream Aggregate #7). Again, an (admittedly small) waste of work to return more columns then needed.
My guess is that this is just a technical limitation of the Table Spool operator. There is no property to define which columns to store in the worktable. It makes sense to assume that the Output List property does double duty for this operator: it defines the columns returned as well as the columns stored in the worktable. That would explain why Table Spool #1 has no other option but to return all three columns, even though they are not used. For Table Spool #8, which consumes the worktable produced elsewhere, a further speculation is needed. My guess is that this is a further limitation, or rather the reverse of the previous limitation: The Output List always has to be equal to the set of columns stored in the worktable.
The extra row
The other loose end is in the number of rows returned by Table Spool #1. In the above explanation, I describe the segment-aware operation of Table Spool as storing the entire segment in its worktable and then returning just a single row to represent that segment. However, there are eleven distinct values of TerritoryID in the data (each of the numbers 1 to 10, plus NULL), and the actual number of rows returned by Table Spool #1 is twelve. Where does that extra row come from?
I must admit that I have not (yet?) figured this out completely. At this point I have two possible theories.
The first theory is that this is just a bug. Table Spool #1 detects that a segment is complete when it reads a row with the segment column set. However, this is actually already the first row of the next segment. This row cannot go into the worktable yet (that would cause incorrect results from the rest of the query), so it has to temporarily store that row in memory. It then first returns a row for the previous segment. When called again, it empties the worktable, stores the row from memory in it, and then proceeds to request the next row.
However, the very first row it reads also has the segment column set. In this case there is no previous segment. Properly coded, this first segment needs to be handled as a special case. But what if this special case was forgotten? It would return a row to represent the (non-existing) previous segment – and no, I do not know what values would be in the columns it returns. That row would then drive an execution of the inner input of Nested Loops #0. Since the worktable is empty, Table Spool #8 returns nothing; this then results in Stream Aggregate #7 and Compute Scalar #6 also returning nothing, so Nested Loops #5 would not even bother to call Table Spool #9 anymore.
All the numbers in the execution plan (both the Actual Number of Rows and the Number of Executions properties of each operator) match up with this explanation. There is but one oddity, though. A scalar Stream Aggregate that reads no rows would normally still produce a single row as its output (just do a SELECT COUNT(*) on an empty table and look at the execution plan to verify). This one doesn’t. The Number of Executions is 12, but the Actual Number of Rows is 11, so it returns nothing in this case. There is nothing in the execution plan, not even in the full XML, that tells the operator to behave different than a “normal” scalar Stream Aggregate. This is the part I do not understand. If anyone does know why this happens, please post a comment and share your knowledge (or even your theory).
The second possible explanation for the 12th row coming from Table Spool #1 is that this is not a bug, but by design. If that is the case, then a Table Spool running on segmented input apparently is designed to always returns one extra row, before returning the rows that represent the segments. That design would only make sense if there are specific use cases where this row is needed and used to perform some specific work. I have not ever seen this in any execution plan I studied, so I have no way to even speculate on the reason for this. If you, reader, have any speculation here, or if you have ever seen an execution plan that gives you reason to believe that extra row is used, please let me know in the comments section!
The rest of the above explanation still applies. The extra row is returned to Nested Loops #0, which invokes its inner input; that inner input ends up returning nothing because Table Spool #8 cannot produce rows from an empty worktable.
Seeing it in action
The video below (double-click to enlarge) visualizes all of the above. Don’t pay attention to the two extra Compute Scalar operators, these are used for the computations I added in order to slow down the processing sufficiently to create this animation.
If you watch the animation, you can see most of what I describe above in action. The first phase is when Clustered Index Scan #4 reads all its rows and returns them to Sort #3. Once done the sorting happens (not slowed down and not visible in the animation) and then the second phase starts.
Due to where I placed the delay, you now miss some of the action as some numbers jump up immediately: 4 rows returned from Sort #3 (all rows for the first segment plus the fourth row, the start of the second segment that tells Table Spool #1 that the first segment is complete), 2 rows returned by Table Spool #1 (the first for the “bug or unknown feature” empty segment, and the first real segment), and you see stuff happening in the lower part of the execution plan. This is in fact already the second execution. Because the first execution processes no rows, it is not affected by the delay and you cannot see it happening. We can infer that it has happened by looking at all the numbers.
After that you see the same process repeat a few more time: a few rows (for a single segment) flowing from Sort #3 through Segment #2 to Table Spool #1, one row returned from Table Spool #1 to Nested Loops #0, and then the inner input of that Nested Loops reading the rows for that segment from Table Spool #8, and then again from Table Spool #9.
This was an extremely long post, even by my standards. If you are here, you either skipped parts (Boo! Hiss!); or you actually read it all (Applause!). Well done!
The main takeaway of this post is that Table Spool is a segment aware operator. When bound to a segment column, it eagerly builds a worktable for a single segment, returns one row to represent that segment, then clears out the worktable and builds it again for the next segment.
In this specific case, other Table Spool operators on the inner side of a Nested Loops join were reading data from that same worktable. Even though the Nested Loops join did not have an Outer References property, so no values were pushed down to change the results returned by the inner input, the ever-changing data in the worktable resulted in the same effect.
Next month will be shorter, I promise. In episode 7, I will look at an execution plan that uses what perhaps is one of the simplest operators in existence, Constant Scan, in a rather creative pattern.
But let me repeat my standard closing paragraph: I do not want this series to be about only my interesting plans. Reader input is greatly encouraged! If you have ever run into a pattern in an execution plan that appeared to make no sense until you dug into the gory details, or even into a pattern that still makes no sense to you, let me know and I will consider it for inclusion in a future post.
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.