T-SQL Tuesday #99: Dealer’s Choice

Back in 2009, Adam Machanic (b|t) started an initiative that is still going strong: T-SQL Tuesday. This is a monthly blog party. Every second Tuesday of each month, lots of people blog about the same topic, selected by the host for that month. And now, in the 99th installment, I decided to finally join in!

For February 2018, our host Aaron Bertrand (b|t) has elected to give bloggers a choice: either share some of the non-tech things that we are passionate about; or go all Aaron-style and tell the world what we consider the worst T-SQL habits that we really want to kick where it hurts. I actually have a lot of passions outside of SQL Server. But there are even more T-SQL habits that I would live to kick (the habit, but also the person responsible). So many that this blog post would become a full-blown novella. So I’ll open door #1 and share some of my non-SQL passions.

My family

My first and foremost passion is my family. Last summer my wife and I celebrated our 25th anniversary in Paris, the city of love, the city where we had our first joined holiday, and the city where we both have a lot of good memories. The picture here shows one of the many museums in Paris. In our opinion, this is even the best museum Paris has to offer. If you ever visit Paris, and unless you truly abhor art, I recommend including Musée d’Orsay in your travel plans.

I’m also lucky to be the father of two healthy children. My daughter is now 22 years old. She has left our home, living a 30-minute drive from our place together with her boyfriend. She is now working as an intern at an elementary school. One day she will be working as either an elementary school teacher or a teaching assistant. Each and every one of us has a few teachers that, no matter how old we are, we always think back of with fond memories. A teacher that helped us climbed a hurdle, that gave us the confidence that we can achieve even the hardest task, or that simply made our life at school better (or less miserable). I am very sure that my daughter will one day be that memory for lots of people!

My son, about to turn 21, dreams of a career on stage, as a professional actor in musical theatre. This started when he was scouted to be part of the child cast for the Dutch production of Joseph and the Amazing Technicolor Dreamcoat. After that experience, he knew how he wanted his future to look. And he’s been working incredibly hard towards this goal ever since. His voice change was late. Until the age of 16 years, he has a very clear, very high boy soprano sound, being able to reach notes that hardly any other child would get. I always expected him to become a tenor, but to our collective surprise he came out of the voice change as a bass-baritone. When he can sing in his own register, his voice is one of the warmest I have ever heard. But he is not only a good singer, he is also an amazing dancer and he is getting better at acting by the day. I know that I, as a father, am probably not the most impartial person. But I think he definitely has what it takes to become a professional musical performer. However, with thousands of aspiring candidates and just a few new openings each year, he needs more than “just” a lot of talent, a strong drive, and a willingness to work incredibly hard. He will also need luck. As a father, I can only hope that he’ll one day be able to achieve the goals he is working towards…

My hobbies

When you sit at a desk as many hours per day as I do (and when you enjoy good food as much as I do), exercise is important. And yes, I know some people are now snickering. I know I do not look like someone who exercises. Yet I do. Imagine how I would look if I didn’t…

For a long time I have been struggling to get exercise in. For a few years I was subscribed to a health and fitness club and vowed to myself to pay a visit every week. But doing exercise on those machines is boring, and I didn’t have the mental strength to force myself to go each week. Until I found jazz ballet. I have hesitated to mention this hobby here because I know some people will snicker even more. But than I decided that those reactions are more telling of those people than they are of me. For me, one hour of jazz ballet burns at least as many calories as two hours of exercise in the fitness club. The complex choreographies and the music are good motivators to never miss a lesson. And having to think about all the different movements of all the extremities has the beneficial side effect of totally quiescing all other activities in my brain – that one hour dancing each week is probably the only hour when my brain is not thinking about SQL Server!

Another unusual hobby I have is playing snooker. I fell in love with this sport when I was struck with the flu. On the couch, I found nothing good on television, but was intrigued by a broadcast of a snooker match on the sports channel. Most English readers will know this sport; for the rest of the world: snooker is like pool billiards, but with a larger table, smaller pockets, smaller (and more!) balls, and because it is an English sport it comes with needlessly convoluted rules and scoring. Just watching snooker as I was trying to get better, I started to appreciate the strategic decision making required for each shot. When a player cannot go for a score, they have lots of options to prevent their opponent from scoring, or even to try to force a foul. After I got better I continued to watch games when they were broadcast, and a few years back I decided to join a club and try playing myself. I soon found that it is a lot harder to play then I though. (I should have known: one of the signs of someone who is really good at something is that they make it appear easy. And the players I watched were the best in their sports, so yeah). But even though I know I am a way-below-mediocre player, I still enjoy playing the game, laughing about my missed shots, and rejoicing when a ball does actually end up where I want it to be!

The final pastime that I want to share with the world is a computer game: Hearthstone. It’s a card game. Or rather, a collectible card game. It is very clear that the game took its inspiration from actual physical collectible card games, such as Magic: The Gathering and Pokémon. But the creators took heavy advantage of this being a computer game: many cards have card texts that would be impractical or even impossible with a physical card game, and many card effects are RNG-based. The result is a game with a lot of strategic elements, but also a lot of random effects that can completely thrash your well though out strategy. I like the fun element this adds. (Usually).

Hearthstone is free to play, but you start with a very limited card set. You can use money to buy card packs – just as with traditional collectible card games. However, you can also “earn” card packs by just playing. Most people do not like this because it forces them to play with mediocre or even bad decks, resulting in lower ranks. But I have found extra pleasure in making the most of the cards I do have, and so I have expressly set myself the challenge to get as far as possible without every paying money for card packs.

For people who know the ranking system and want to get a feel of my level of play: in ranked I usually end each month between ranks 10 and 15, though I have once actually managed to hit rank 5. This is partly because I choose not to pay for more cards, but also for a large part because I do not play ranked very much. After encountering the same three decks for many games in a row, I get bored. I much more enjoy playing the arena. The drafting method used in arena means I have the same chances of getting good or bad cards as all other players, and I will never encounter lots of people all playing the same deck! Because I still am a data freak, I have tracked my arena runs since August 2015, a few months after I discovered the game. Over the course of 418 arena runs, my average result is 4.24 wins per run. Since the average result for all players across the world is about 3 wins per run (2.991577, to be very exact), I am proud to say that I am an above average arena player.

Back to work!

This has become an unusual blog post. I usually don’t share much about my personal life. But I can in this case blame Aaron and deny all responsibility!

It was fun to write on a different tone for a while. But now I really want to open up Management Studio and do some tuning!

Thanks, Aaron!

Plansplaining, part 2. Why scan and spool instead of seek?

This is the second post in the plansplaining series. Each of these blog posts focuses on a sample execution plan and details exactly how that plan works.

In the first post, I covered each individual step of each operator in great detail, to make sure that everyone understands exactly how operators work in the pull-based execution plans. In this post (and all future installments), I will leave out the details that I now assume to be known to my readers. If you did not read part 1 already, I suggest you start there.

Sample query

The query in this post is again based on the standard AdventureWorks sample database. But it needs an additional index, so let’s create that first:

With the index in place, let’s now look at the sample query:

The index appears to be custom tailored for this query. Nobody would ever blame you if you expect an execution plan that reads the SalesPerson table, then uses a Nested Loops join into an Index Seek on the new index, plus of course some aggregation somewhere. However, you would be wrong. In reality, the execution plan is as shown below:

The plan does indeed read all rows from SalesPerson, using a Clustered Index Scan. It also does feed those rows into a Nested Loops join. But the inner (lower) input of the Nested Loops uses an Index Scan instead of an Index Seek, and adds a Table Spool. Why did the optimizer choose this plan? And perhaps even more important, how does it even give us the data we need?

Let’s dive in

Let’s dive in, and let’s focus directly on the part that matters: the inner input of the Nested Loops join. After Nested Loops receives the first row from Clustered Index Scan, it initializes the lower branch and requests the first row.

The first iteration

The Index Scan on the right-hand side of that input has its Ordered property set to True, so in this case we know for sure that data will be returned ordered by SalesTerritoryID: first all rows with SalesTerritoryID 1, then those with 2, etc. But of course, as all operators it only returns data when requested to give data. In this case it is a Stream Aggregate that calls the Index Scan. The Stream Aggregate will continue to call Index Scan until the value in its Group By column (TerritoryID) changes. So in this case, it will read all rows with SalesTerritoryID 1, then receive the first row with SalesTerritoryID 2, and then stop calling Index Scan and return a single row with SalesTerritoryID 1 and a second column (Expr1002) that holds the sum of all TaxAmt values. (This can be seen from the Defined Values property).

Table Spool is working as a Lazy Spool. Upon receiving this first row, it stores it in a worktable, and then immediately returns it to Nested Loops. This operator will verify its predicate to test if this row matches the row from the SalesPerson table. If it doesn’t, then it calls the Table Spool again, requesting the next row. If it does, it returns the combined results and waits until called again, and then requests the next row from Table Spool. So either way, Table Spool is requested for a second row.

When Table Spool regains control, it will once more pass on the request for a row to Stream Aggregate. This operator already received a single row with SalesTerritoryID 2. It now repeatedly invokes Index Seek to get the other rows for this territory, until it sees the next new value (3). Stream Aggregate then returns SalesTerritoryID 2 and its SUM(TaxAmt) to Table Spool, which adds this row to the worktable and then returns it. The worktable now has two rows stored, for the first two sales territories.

The above process repeats until all rows from SalesOrderHeader are read. There are only ten distinct SalesTerritoryID values in this table, so Table Spool returns ten rows (after storing them in the worktable). Upon the eleventh call, Table Spool returns “end of data”. Nested Loops has Left Outer Join as its logical operation, so it checks if at least one match for the row from SalesPerson was found. If not it now returns that row with a NULL value for Expr1002 and waits to regain control; otherwise it will immediately move on.

The second iteration

When the first iteration of the inner (lower) input is done, Nested Loops returns to the outer (upper) input and reads the second sales person. It then reinitializes the inner input and again calls the Table Spool. However, this second initialization is a rewind. When a Nested Loops operator does not have an Outer References property (as in this case), then by definition the first execution of the inner input is a rebind and all other executions are rewinds.

Table Spool is one of a select group of operators that actually care about the difference between rebinds and rewinds. The description above, for the first execution, is for a rebind. For a rewind, the Table Spool does not call its child operators. Instead, it uses the worktable created previously to return the same set of rows as the previous time.

So for the second salesperson, Table Spool is called, looks at the worktable, and returns the first row stored (the row with SalesPersonID 1 and its total TaxAmt value). Then when Nested Loops calls it again it reads and returns the second row from the worktable, and so on until all rows are returned and it once more responds with “end of data”.

Bringing it together

There are a total of seventeen rows in the SalesPerson table. For each of those rows the inner (lower) input is executed. The first execution is a rebind, so the Index Scan and Stream Aggregate are returned. All of the other executions are rewinds; for these the child operators do not run because Table Spool returns the data from the worktable.

Do not be misled by first appearances. Normally the lower input of a Nested Loops executes multiple times. Normally you do not want to see a scan there, especially not on a table as large as SalesOrderHeader. But in this case, first looks deceive. The only operator that actually executes multiple times is the Table Spool, which uses a worktable to store the output for later reuse, shielding its child operators from having to execute multiple times.

Much of this is directly visible from the execution plan. On the Table Spool operator, you can check the Actual Rebinds and Actual Rewinds to get confirmation that there was 1 rebind and 16 rewinds. The Number of Executions on the Table Spool is 17 (as expected, to match the 17 rows read into the Nested Loops operator), but Stream Aggregate and Index Scan both have their Number of Executions set to 1: they executed for the single rebind, but were not used for the rewinds.

But what if I want a seek?

The mystery is solved. The execution plan is dissected, and I now know exactly how the operators in this plan cooperate to produce the intended results. But I still have a feeling that my original idea, with an index seek in the lower input to a Nested Loops join, would also have been a valid idea. To help me understand why the optimizer disagrees, I can do something that I would not do in production: add a hint to force my ideas upon the optimizer.

I added a FORCESEEK hint to ensure that the optimizer has no other option but to give me an execution plan that uses an Index Seek on the SalesOrderHeader table. And yet, the execution plan is not what I was expecting to see. For sure, the Index Seek is there. But there still is also a spool operator. This time it’s not a Table Spool but an Index Spool. But it’s still there!

So let’s once more dive into the execution plan and see what happens during execution!

The first iteration

One of the differences between the first plan and the current version is that the Nested Loops operator has replaced its Predicate property with an Outer References property. Just as in the execution plan I looked at in part 1, the Outer References column from the top input (in this case SalesTerritoryID) is used in (“pushed into”) the bottom input so that it only returns matching rows.

We can see this in the Index Seek. The new index is used to retrieve only the rows we need for the current sales territory. Because of that, Stream Aggregate has lost its Group By property; it now gets only rows for a single sales territory on each run so it can do global aggregation and just return a single row with the sum of TaxAmt in all rows it received.

The final part of the bottom input is the spool, which has changed from a Table Spool to an Index Spool. An Index Spool is very similar to a Table Spool, with one additional extra. A Table Spool just stores data in a worktable, which has a clustered index on a hidden column that represents the insertion order (used to ensure that rows are returned in the same order when they are fetched from the worktable on a rewind). An Index Seek adds an additional (nonclustered) index on top of the worktable.

Finding the key of that extra index is not trivial. The property that holds this information is not shown in the graphical execution plan. The only options are to either infer this information from the Seek Predicate property of the Index Spool, or to open the XML representation of the execution plan and look for the RangeColumns node within he SeekPredicateNew element. In most cases, the first method works just fine. In our current example, the worktable will be indexed on TerritoryID.

During the first iteration, the Index Spool receives, stores (and indexes), and returns just a single row, with the total TaxAmt for the sales territory of the first sales person.

The second iteration

Since this plan uses Outer References on the Nested Loops operators, subsequent executions of the lower input can be rebinds or rewinds. They are a rewind when the next salesperson happen to have the same sales territory as the previous one; otherwise they are a rebind.

At this point we need to look at a few more differences between Table Spool and Index Spool. For starters, a Table Spool clears out its worktable on starts building from scratch on a rebind; an Index Spool doesn’t clear the worktable but simply adds the new rows to the data already there. Second, a rebind causes a Table Spool to return all rows stored in the worktable; an Index Spool will only return rows that match its Seek Predicate property (for which it uses the additional nonclustered index). These are the rows that were added during the last rebind. So in effect, both operators return the same rows as the previous time on a rewind; Table Spool does this by throwing out everything it had before, and Index Spool does this by using an index to read only the most recently added data.

So long story short, every execution of the lower branch will either use the Index Seek to find only matching rows, total up the tax in Stream Aggregate, and then store and return it; or use the index on the worktable to find the previously computed total tax for a sales territory.

The missing piece

One thing you should never do when working with execution plans is jumping to conclusions. The conclusion above is correct, but incomplete, and that can be seen by looking at all the available information in the execution plan.

If you look at the properties of Index Spool, you see that Actual Rebinds is 16, and Actual Rewinds is 1. Looking at the returned query results, that makes sense. The data is not ordered by sales territory; by pure coincidence there happens to be one set of two consecutive rows with the same sales territory, which causes the single rewind.

Based on 1 rewind and 16 rebinds, the expected number of executions for the Stream Aggregate and Index Seek would be 16, right? And yet, the execution plan shows the Number of Executions for these two operators to be 11. Where are the five missing executions?

The answer is in one last difference between Index Spool and Table Spool (or rather, between Index Spool and all other operators that observe rewinds and rebinds). You see, when Index Spool is started as a rebind, it does faithfully increment the rebind counter, but then it checks the data available in its indexed worktable. If the data is needs is already there, it will ignore the instruction to rebind and instead use the nonclustered index to return the correct subset of rows.

In other words, Index Spool counts a rewind when the immediately preceding execution used the same values in the Outer References columns, but acts as a rewind when any of the preceding executions already used the same values. There are eleven different sales values for SalesTerritoryID in the SalesPerson table (1 to 10, plus NULL). For each of these, the Stream Aggregate and Index Seek only execute the first time the value is seen; when the same value is found again, no matter how many rows were in between, the worktable is used to pull the total tax amount.

Do we really need that spool?

It is now clear what the function of the Index Spool is. Without it, the Index Seek would execute 17 times, and 6 of those executions would be used to retrieve data that was already retrieved before. By storing only the single aggregated row in a worktable, a lot of I/O on the SalesOrderHeader can be avoided.

If you don’t believe me, you can easily check this. In SQL Server 2016, a new query hint was added: NO_PERFORMANCE_SPOOL. This prevents the optimizer from adding a spool operator in queries to help performance. (It can still use spools for other reasons).

With this hint added, we finally get the execution plan that I described at the start of this post: a simple Nested Loops into an Index Seek with aggregation:

Because the spool is removed, Stream Aggregate and Index Seek now execute once for each of the 17 rows coming from SalesPerson. Some of the data from SalesOrderHeader is read more than once. If you execute SET STATISTICS IO ON; before running this query, you will see 179 logical reads on this table, as opposed to 115 for the previous query. On the other hand, the 107 logical reads on the worktable for the previous query are now gone.

Faster or slower?

The total number of logical reads for this third query is lower than for the second. But not all logical IOs are created equal. The optimizer appears to consider I/O on a worktable as less expensive than on a normal table. Probably because it expects I/O in tempdb to be very likely to be in the buffer pool only, whereas I/O on a permanent table has a higher chance to be physical I/O. Also, I have a strong suspicion that the optimizer focuses on the IO cost of reading data but under-estimates the cost of writing to a worktable. (This suspicion is based on other plans I looked at, beyond the scope of this post).

The actual performance difference does not depend on assumptions in the optimizer’s costing model, but on differences in your setup. Is the data currently in the buffer pool or not? How much memory pressure on your server? How many disk files are used for tempdb and for the user database, and on what type of storage? Etc.

One way I like to use to measure performance of short-running queries like this one is to enable the “Discard results after execution” option in the SSMS Query Options, make sure that the statistics io and statistics time settings as well as the Include Actual Execution Plan option are disabled, enable “Include Client Statistics”, and type GO 1000 (or another suitably high number) after the query to make it run a thousand times. I can then look at the Total execution time in the Client Statistics pane to see how much time it took. (Obviously, this method measures only elapsed time. If your server has severe bottlenecks for e.g. IO or CPU usage, then elapsed time is not the metric you should tune for).

On my laptop, the query with Index Seek and Index Spool takes 3.2 milliseconds on average; the version without Index Spool takes 5.0 milliseconds.

Why scan instead of seek?

We now know why the optimizer adds an Index Spool when we force it to seek the index. But one question remains: why do we even have to use this hint? Why does the optimizer prefer to use an Index Scan?

A part of the answer is in the numbers. In the first plan, the Index Scan executes once, reading the entire table. In the second plan, the Index Seek reads only a subset of the table in each execution, but it executes 11 times and all executions combined still read all the rows. This is in fact a bit more expensive, as each seek starts by traversing the root and intermediate pages. You can check this if you run both queries with SET STATISTICS IO ON enabled. The first query needs 88 logical reads on SalesOrderHeader; the second query requires 115 logical reads.

The Table Spool in the first query builds a single table with 10 rows. The Index Spool also builds a single table (in this case with 11 rows), but with an additional index; that should add some cost. This, too, is visible in the statistics io: 53 logical reads on the worktable for the first query versus 107 on the second.

Finally, the Index Spool returns just a single row to Nested Loops for each iteration, whereas Table Spool stubbornly returns the full set of 10 rows every time. With this size of table, that should not affect the logical IOs (the rows are all on a single page). There will be a bit of extra CPU usage but that will be minimal in comparison to the IO cost. The same goes for the different CPU requirement for Stream Aggregate of 11 global aggregations versus a single aggregation with Group By.

So here we see that the second query uses more IOs on the SalesOrderHeader table, and has a more expensive spool type. There are definitely situations where Index Spool is great. Its ability Spool to retrieve “older” sets can save tons of processing, and its ability to use a nonclustered index to return only matching rows can also be more efficient than a simple Table Spool that returns all rows. But not in this case.

However, when I use the Client Statistics to verify this, I find that the optimizer is using incorrect assumptions. On my laptop, the original, unhinted query takes 3.8 milliseconds on average, slightly slower than the version with the FORCESEEK hint. I do not have a full explanation for this, but I assume that, because there is no contention and no memory pressure at all on my laptop, the additional processing caused by Nested Loops receiving 10 rows for every salesperson and having to test the predicate every time actually outweighs the lower logical IOs.

On a busy server, I would not expect to see the same results. And even if I did, I would probably now trade the extra IO pressure for the small performance gain.

Cleanup

I like to keep my AdventureWorks unchanged. This ensures that demo code posted by others works the same for me, and that my future posts will work as expected for all readers.

I added an index at the start of this post. I can remove that by restoring AdventureWorks from the originally downloaded backup, but in this case it is easier to simply drop the index:

Conclusion

We looked at a query where we expected a simple Index Seek, but instead had an execution plan that used an Index Scan, and a Table Spool to save and reuse the results. Because the results were aggregated, repeatedly reading them from the worktable causes far less IO then having an Index Seek that executes multiple times.

Using hints, we were able to force the optimizer to use a seek. Another hint allowed us to get rid of the spool operator. However, performance measurements prove that the spool is actually needed to get better performance.

The original choice for the index scan is not so clear cut. For the number of logical IOs, the original plan is clearly a winner. However, its actual runtime turns out to be slightly longer than the plan that uses an Index Seek. In my opinion, the performance gain is not enough to justify the additional logical IOs, so I think the optimizer actually made the right choice.

In the next installment in the Plansplaining series, I will look at an execution plan that uses Constant Scan and Nested Loops operators to, apparently, make the same logic run multiple times instead of just once. Want to know why? Just watch this space!

Tags:

Plansplaining, part 1. The unexpected aggregation and assert

When I look at an execution plan I sometimes right away see how the operators work together. Other times I need to dive into the details and put in effort before I really understand it. And occasionally, I think I understand but then am proven wrong after I start looking at the details. However, understanding all the details of an execution plan is really important when you want to optimize performance. If I see an execution plan where I do not understand the role of each and every operator, I know I do not truly understand how the query is executed. Yet.

This post is the first in a series. In every part, I will look at a query and its execution plan, highlight possibly confusing, misleading or simply non-obvious elements in the plan, and explain why those operators are there and how they interact to return the requested results. It may be a bit long sometimes, it may be a bit dry. But if you put in the effort to read this and try to understand, you will become a better performance tuner.

A simple query

In this post I will be looking at the execution plan for a very simple query (using the standard AdventureWorks sample database):

The execution plan for this query is shown below. It is from SQL Server 2017, but it is probably the same on all versions of SQL Server. At first sight it doesn’t look bad. I see an index scan for the Person table. Since there is no WHERE clause, that is to be expected. The subquery that grabs the phone number is implemented via a join operator. The logical join type is Left Outer Join, so that a row from the Person table that does not exist in the PersonPhone table is not dropped from the results.

But if I look a bit further, I also see things that are more surprising. An Assert operator is very normal in a plan for modification (insert, update, delete), for constraint checking. But this is a normal select, so why is that operator here? And why has SQL Server added an aggregation operator?

And there is one more thing. The Person table has almost 19,972 rows. Why did the optimizer choose to use a Nested Loops join? Would a Merge Join or Hash Match join not be better in this case? When I execute the query with SET STATISTICS IO ON added to the batch, I get a confirmation that the Nested Loops join operator is in fact causing overhead. The 211 logical reads on the Person table are expected for this type of query; the 40,242 logical reads on PersonPhone appear to be excessive.

Step by step

It is tempting to immediately jump into tuning mode and try to find ways to change the join type, and perhaps get rid of the Assert and Stream Aggregate operator that I never asked for. But the optimizer usually has good reasons for what it does, so I exercise restraint. I first need to know why those operators are here. And that means that I need to make sure I understand their role within the execution plan. The best way to do that is to work out exactly what happens when the execution plan runs. That can take a lot of time, but you can consider that an investment. Once you know why this plan looks the way it looks, you can apply that knowledge every time you see a similar plan.

The start

Execution plans always start with the top-left operator (SELECT) calling its child operator (Compute Scalar). This is the GetNext() call, requesting a row. Most operators cannot return a row before they receive one, so most operators respond to the first GetNext() call by immediately calling their child operator. The operators in this plan are no exception. When execution starts, SELECT calls Compute Scalar, which call Nested Loops, which in turn invokes Index Scan (the child operator on its top, or “outer” input).

The Index Scan operators is the first one that actually does something interesting. The Output List property shows which columns it returns: BusinessEntityID, FirstName, MiddleName, and LastName. There is Predicate property set, so it returns all rows. Looking at the original query, this is not surprising. But of course GetNext() requests just a single row, so the Index Scan will for now only navigate to the first page, grab the first row from that page, and return the values from that row to its parent (Nested Loops). It passes control back to Nested Loops, while maintaining state so that it can continue where it left of when called again.

Nested Loops though does not call Index Scan again. After it receives a row from the top input, it moves to the bottom input by calling the GetNext() entry point of the Assert operator.

The bottom input

The Assert operator responds to this GetNext() request by immediately calling the Stream Aggregate operator, which in turn requests a row from the Clustered Index Seek.

The properties of the Clustered Index Seek reveal that this operator will use the structure of the clustered index to navigate directly to the page that stores the phone numbers of the person just read from the Person table. At this point, there are two possibilities: a matching row is found, or not. Let’s for now assume that the operator does find a row. The Output List property shows only the PhoneNumber column, so this value is now returned to Stream Aggregate.

The Stream Aggregate operator in this case does not have the Group By property. This means that it produces a scalar aggregate: a single row with the aggregation results of all input rows. To do this it must first read all rows from the input. So after Clustered Index Seek returns its first row, Stream Aggregate updates some internal data structures and them immediately calls Clustered Index Seek again. However, I happen to know that there are no persons with more than one phone number in the PersonPhone table. So when Clustered Index Seek is called again, it will not find a second matching row and instead return an “end of data” indicator.

Looking again at the properties of Stream Aggregate, I see that it returns a row with two columns, called Expr1004 and Expr1005. Expr1004 is defined as Count(*), the number of rows, so this will be 1. The Expr1005 column is defined as ANY(PhoneNumber). ANY is an aggregate function that we cannot use in our T-SQL, but the optimizer can use it in execution plans; it means “just give me one of the values from the input, I don’t care which one”. In this case, with just one row in the input, there is actually little choice.

This row, with Expr1004 set to 1 and Expr1005 set to the phone number for the person from the current row in the Person table, is then passed to the Assert operator. Assert evaluates an expression (shown in its Predicate property) and then passes the row unchanged when that expression evaluates to NULL, or stops execution with an error when it evaluates to another value. In this plan, the Predicate property of the Assert is “CASE WHEN Expr1004 > 1 THEN 0 ELSE NULL END” (edited slightly for readability). Expr1004 is equal to 1, so this expression evaluates to NULL. Assert passes the row. However, its Output List property shows that it passes only Expr1005 (the phone number). Expr1004 is not needed anymore.

Returning the first row

After Assert returns Expr1005 to the Nested Loops operator, it can combine the data from both inputs. There is no Predicate property on this Nested Loops, so whatever is returned from the bottom input is assumed to be a match. (A safe assumption because the bottom input used a value from the top input to return only matching data; this can be seen in the Outer References property of the operator). The Nested Loops operator now returns a row with the four columns from the Person table, plus Expr1005. Which, as we remember, is “any” of the phone numbers found for the person, which was just a single one in this case.

The Compute Scalar operator then receives this row, and computes a new column, Expr1003, which is set to be equal to Expr1005. This sounds like an utter performance waste. And it would be if SQL Server would actually do this. However, the properties in an actual execution show no actual row count and no actual execution count for this operator. What we see here is an artefact of how plans are generated. Redundant Compute Scalar operators are sometimes left in the plan, but do not actually execute. They are removed from the plan in a final cleanup stage, after the plan was already selected, but this is not reflected in the plan we see. Seeing an operator with no values for the “Actual Number of Rows” and “Number of Executions” properties is a dead giveaway that the operator was removed because it was either redundant or its functionality was collapsed into another operator. In this case, it was redundant. So in reality, the row formed in the Nested Loops operator is returned to SELECT, which represents the client application in the execution plan.

An unmatched row

The SELECT operator will immediately request the next row, and as we now know it doesn’t request this of Compute Scalar but directly of Nested Loops. Nested Loops then needs to check to see if the bottom input has more rows, so it requests the next row from Assert, which relays the request to Stream Aggregate. But Stream Aggregate in this case was set up as a scalar aggregate that returns a single row from the entire input, so it can never return a second row. It returns the “end of data” indicator, that Assert faithfully passes on the Nested Loops.

Since the bottom input is now exhausted, Nested Loops moves back to its top input and requests the next row from the Index Scan. This operator will continue where it left off last time, so now the second row from the indicated index is returned. After receiving this second row, Nested Loops reinitializes its bottom input and the entire process starts over.

We already know what happens when the Clustered Index Seek does find a row. In the sake of keeping this interesting, let’s assume the second row does not have a match in PersonPhone. After Nested Loops kicks off the process, the operators call each other until Clustered Index Seek searches for a matching row and doesn’t find anything. So it can’t return a row to Stream Aggregate. It immediately returns the “end of data” indicator.

The Stream Aggregate operator usually does not return anything if it doesn’t receive any rows itself. But in this plan it is doing scalar aggregation, and that is the exception. Scalar aggregation always returns one row. In this case, Expr1004, the Count(*), will be set to 0; Expr1005 will be NULL. Assert receives this row, checks its Predicate, and since Expr1004 is not more than one, it will pass the row to the Nested Loops operator. This operator adds the NULL value in Expr1005 as the phone number, and returns this to its caller, to be sent to the client. And that’s how rows without phone numbers are handled.

Obviously, this process repeats over and over until all rows from the Person table have been handled and the query finishes. We now know exactly how the query handles all cases in our test data: both persons with and persons without a phone number. We know exactly what the Assert and Stream Aggregate operator do. But we do not yet understand WHY. So far, they have not affected the results. Why are they here?

What about multiple matches?

The Predicate expression of the Assert operator gives a clue as to why the operator is in the plan. Obviously, the optimizer wants to do something different when a person has multiple phone numbers. I know that this is not the case in the data in AdventureWorks, but SQL Server doesn’t know that. So it needs to handle this situation, and apparently it needs handling in this specific case. Let’s look. First in theory, and then test to verify.

Nobody needs more than one phone!

So now we assume that we added a row to PersonPhone, for a person that already had a row there. At one point the Index Scan will read this person. It restarts its bottom input and requests a row; the operators call each other and Clustered Index Seek finds the first matching row and returns it to Stream Aggregate. Nothing new so far, this is the same as we saw for the first row. However, now Stream Aggregate requests another row and actually gets a row returned. It updates its internal data structures and, again, requests the next row. Clustered Index Seek does not find a third phone number for this person, so “end of data” is returned.

After it has read all the input rows Stream Aggregate now uses its internal data structures to return the correct values. Expr1004, the Count(*), is equal to 2; Expr1005, the ANY expression, returns one of the two phone numbers that were input. Which one is impossible to tell; the way ANY chooses its returned value is undocumented. But we do know that it will be one of the phone numbers for this person.

However, that phone number will never make it to the client. Since Expr1004 is now 2, the Predicate of the Assert operator evaluates to 0, not NULL. Any non-NULL result causes the Assert operator to immediately stop processing of the execution plan and throw an error. Which error that is can’t be seen from the execution plan. So perhaps this is a good time to run a test and verify what happens.

Let’s test

In order to test this, I will add a row to the PersonPhone table. I can insert it and then delete it later, or use a transaction that I roll back when done. I chose the latter:

When I run this query, I get partial results. The query returns it first 12,415 rows, but then an error occurs and processing stops. The messages pane shows an error message:

And there we have our explanation. The SQL language does not like “random” results. When we use a subquery in a place where just a single value is expected, it gives us the benefit of the doubt. “My developers know what they are doing, they will know that this will never return multiple rows”. But true to the paradigm “trust, but verify”, it does add two operators, Stream Aggregate and Assert. These two operators ensure that an error message is thrown if it turns out the developer did not actually know what they were doing.

Can we tune this?

We now know why the Assert and Stream Aggregate operators are there. And this also “sort of” explains why a Nested Loops join algorithm was used. Merge Join and Hash Match both do a single pass over both inputs. The trick of aggregating all rows and seeing if there is more than one would not work if we read all rows instead of only rows matching the current person.

That doesn’t mean there are no other ways to achieve this result. The optimizer could have opted for a plan with a Merge Join or Hash Match by using a variation on this trick. It could do a Clustered Index Scan of all rows from PersonPhone, then feed those rows into a Stream Aggregate with a Group By property. This would return one row for each BusinessEntityID, with one of the phone numbers and a count. The same Assert can then stop execution when the count is too high. That would implement the same test for persons with multiple phones with just a single pass over the input, so now the two streams can be combined in a Merge Join or Hash Match join. Why didn’t the optimizer make that plan?

There can be multiple answers. Perhaps that plan introduces other overhead, making it more expensive. Perhaps the assumptions built into the costing rules make the optimizer think it is more expensive. Perhaps the set of rules used by the optimizer does not include a path from the actual plan to the theoretic one above. Or perhaps the optimizer stopped searching for better plans before it found this option.

Rewrite attempt

Sometimes rewriting a query can work to nudge the optimizer into a different plan. That is not guaranteed to work; I have been outstubborned by the optimizer more often than I care to admit. However, in this case it turns out that it’s fairly easy to convince the optimizer. Start with a CTE that reads the PersonPhone table and aggregates by Person; join that to the Person table. Actually just a query that matched the plan I am looking for.

Now I can’t directly use an Assert in my query, but I can use a CASE expression that will force a run-time error when a person has more than one phone. I also do not have an ANY aggregation function. But since I will only actually return a row when the number of rows is 1, I can pick almost every aggregation function. In this case I decided to use MIN.

The query above returns the same data as the original query. It also generates a runtime error, just as the original query, although the error message is different. And it results in the type of plan I was hoping for:

This plan uses Hash Match to join the two data streams. The top input is PersonPhone, aggregated by PersonID. The bottom input is Person. Both inputs are scanned only once, and a lot of I/O is saved this way as demonstrated by the output of SET STATISTICS IO:

The estimated cost of the execution plan for this query is 0.959. For the original plan, that was 3.921. This plan actually has a much lower estimated cost, and saves a huge amount of I/O. It does introduce a Hash Match operator, making it blocking and more memory hungry. I am going to guess that the optimizer would have picked this plan if it had found it while exploring alternatives. But that is just a guess.

Which one to use?

We now have two versions of the query. It is tempting to say that we should always use one with the lowest estimated cost, and the least I/O. But that is not always the case. The cheaper query does need more memory (in my tests it requests a memory grant of 8416KB), which can be problematic on servers that are already starved for memory. The Hash Match operator causes the plan to be partially blocking, as opposed to a fully streaming plan for the original query. And when there are persons with more than one phone, the original query stops executing as soon as it encounters one such person; the new version first finishes the entire build phase of the Hash Match and will only stop when it encounters the first problem row in the probe phase. Three reasons that might convince you, based on the requirements and limitations of your data and your server, to choose the original query.

Plus, obviously, the original query is a lot easier to understand and maintain; I would never allow the result of my rewrite attempt to go in production code without a long comment to explain what’s going on.

Conclusion

In this post I looked at a sample execution plan and dove into the details of each of the operators to understand their function relative to each other and to the plan as a whole. This is what every SQL Server professional should do when looking at an execution plan.

In this case, looking at all the details of the execution plan helped us find a way to speed up the query (with some downsides). That will not always be the case. Still, when you look at an element in an execution plan you do not understand, you always have to do the work to build that understanding. The result may not be a way to speed up the query, but you only know that for sure after you put in the work.

This post was very verbose, and very detailed. That’s because I also used this post to illustrate some basics of reading execution plans. That’s not how my mental process normally works when analyzing execution plans. I would look at the plan and blend out the parts I already understand. (I would still return to them if needed if needed to make sense of the rest). For this plan, that would leave the Assert, the Stream Aggregate, and the choice of physical join type. I would start with those operators. Upon seeing the Predicate of Assert, I would try to find where Expr1004 is computed, which would bring me to the Defined Values property of the Stream Aggregate operator. And at that point, I would understand the whole logic.

For future posts in the Plansplaining series, I plan to assume more basic understanding of execution plans than what I did for this post. They will be somewhere between the very long description in this post, and the very short description in the first paragraph.

Tags:

The many (too many?) reads of a many to many Merge Join

SQL Server’s optimizer can choose between four different physical operators for joining data. The Merge Join operator is usually the best choice if the input data is already sorted by the join columns. It processes both inputs at the same time, every time advancing the input that has the lowest value of the join key. This algorithm, known as “balance line processing” back in the olden days of magnetic tape storage, always finds all matches with just a single pass over each of the inputs.

One to many

One of the most important properties in a Merge Join operator is Many to Many. Far too few people are aware how important this property is. Ideally, the value of this property is False. In that case, the Merge Join runs in “one to many” mode, which is every bit as efficient as the introduction suggests. To see an example of this ideal case, run the query below in the AdventureWorks sample database:

The execution plan for this query looks like this. Both input tables are scanned. That makes sense: there are no filters in the query, so we need all rows. Scanning is the optimal way to read all rows. A Merge Join operator combines the inputs. If you look at the properties of the Merge Join, you’ll see that the Many to Many property is False: this is a one to many join.

You can also see that the optimizer has switched the order of the two inputs. For a one to many Merge Join, the input without duplicates has to be the top input. In this case, it is the primary key constraint on SalesOrderID in the SalesOrderHeader table. It guarantees that there are no duplicate values, therefore this input had to be on top.

As stated above, the Merge Join algorithm always advances the input with the lower value in the join columns. If the values are equal, it always advances the bottom input, the one that can have duplicates. If the next row does indeed have the same SalesOrderID as the previous row, it will still match with the current row from the top input. The top input only advances when the bottom input has reached the next higher value. Because the top input has no duplicates, advancing it will also produce the next higher value from that input.

Many to many

The Merge Join operator has complete symmetry in the supported logical join operations. Whether outer, semi, or anti semi – both the left and the right versions work. This means that there is no reason to worry about “many to one” (as opposed to “one to many”): the optimizer would simply reverse the order of the inputs. When both inputs have guaranteed uniqueness (“one to one”), then either input can be at the top side and it will work. But what if we have no uniqueness guarantee in either of the inputs? The query below illustrates this.

The execution plan for this query looks quite similar to the one we saw before. However, in this case the Merge Join has a very significant difference, only visible in its properties. The Many to Many property is True in this case, This execution plan is far less efficient than the previous example.

Before going into the reasons why a many to many join is less efficient, let’s first check to see what would happen if we were to use the same algorithm. At one point, it finds a match. Both inputs have the same ProductID value, so the bottom input advances to find the next ProductVendor. If this is for the same ProductID, it is another match with the current row from SalesOrderDetail and the ProductVendor input will once more be advanced. This continues until we hit a new ProductID in ProductVendor. At that point the next row from SalesOrderDetail is read. But what it that row has the same ProductID as the previous row? The bottom input has just advanced past all the matching rows, that are now actually needed again. But execution plan operators do not support a GetPrevious() method; only GetNext() exists. How does the Merge Join produce the required join results in this case? The answer is that it stores the data it might need again in tempdb.

You can see this when you rerun the queries with SET STATISTICS IO ON. The query with a one to many Merge Join shows only the logical reads for the two input tables. However, the query with a many to many Merge Join has a third line of output, for a “Worktable”. This represents the scratch area that the Merge Join operator has to use because there is no other way to retrieve “earlier” rows when the same join column value is repeated.

But why so many?

What surprises me in the statistics io output of the last example is the number of reads for the worktable. And why 18,013 logical reads? Doesn’t that seem a bit excessive in comparison to the amount of data processed? And why 19 scans, where does this number come from?

One way to find out is to look what happens when we change the data. For instance, how will this number change if we add one extra row to the ProductVendor table before running the query? Here is the code I used to try this:

I executed this code multiple times with different values for ProductID in the inserted row. The effect on the statistics io output are in the table below. (For the record, I actually ran many more tests, with different values, different tables, other queries, etc.; these are just the most interesting results).

ProductID Scans Logical reads
906 19 18,013
907 20 18,245
908 19 18,015

So … adding one row to the ProductVendor table can leave the results unchanged, or add just 2 logical reads, or add a scan and 232 logical reads. How does this makes sense? It actually does if you also look at the existing data in the database. There currently are no rows for ProductID 906, one row for 907, and two rows for 908. So adding a row that is not a duplicate has no effect on the worktable, adding a row to cause a duplicate increases the scan count and causes a lot of extra logical reads, and adding further duplicates for an already duplicated value only adds two logical reads. This gives me the information I need to understand how this works.

The explanation

Adding a non-duplicate does not affect the worktable. Until today I thought that a many to many Merge Join simply copies all rows from one output to the worktable, in case they are needed. Now I know I was wrong. Only actual duplicates in the bottom input cause worktable activity. Apparently, when a many to many Merge Join finds a new value in the bottom input it peeks ahead at the value after that to see if this new value is a duplicate or not. If it is, it copies rows from the top input to the worktable as it processes them. The data in the worktable can then be read for all other rows in the bottom input with the same value in the join columns. If it is not a duplicate, then the worktable is not touched at all. I was able to confirm this by creating a query that does a many to many Merge Join where the bottom input could have duplicates but in reality didn’t. The statistics io output showed zero scans and zero logical reads.

Adding an extra row with an already duplicated value does impact worktable activity, but only for an additional two logical reads. The row was already a duplicate. The values from the top input already are in the worktable. By adding one extra row in the bottom input, I caused one extra pass over the data in the worktable. That takes two logical reads, which makes perfect sense for reading the 153 SalesOrderDetails rows with ProductID 908. They all fit on a single page. The two reads are for the root page and the leaf page of the worktable’s clustered index.

Adding a row to create a duplicate has a much higher impact. When I added ProductID 907, the number of logical reads went up by 232. Two of those are produced when the duplicate row is processed, for reading the data from the worktable. That means that writing to the worktable incurs 230 logical reads. There are 113 SalesOrderDetail rows for ProductID 907, so it appears that the number of logical reads is equal to two reads per row added to the worktable, plus 4. Further experiments with other values confirm that this formula is correct. This leads me to the conclusion that the worktable used for a many to many Merge Join has the same internal structure as the worktable used by a Table Spool (see this blog post): it uses a clustered index that it has to traverse for every added row.

The number of scans for the worktable is always equal to the number of values with duplicates in the bottom input. In the original ProductVendor table, 19 products that are also in SalesOrderHeader have more than one vendor. How often these duplicated values appear does not impact the scan count. This seems counter-intuitive to me. I would expect to see one scan for every time the worktable serves a copy of the data. In reality the scan count goes up when the worktable is cleared out and filled with new data. I have no explanation for this. If anyone does: please use the comment section below to let me know!

Conclusion

Whenever I see something in an execution plan that I do not understand, I dive in and try to figure out what is happening. In this case, it was the high number of logical reads that triggered me to investigate. I now have a better understanding of how Merge Join works, and where these numbers come from.

This knowledge can come in handy. I already knew how important it is to have constraints, so that the optimizer has maximum information. Even when there are no actual duplicates in one of the inputs, if the optimizer thinks there might be duplicates it will use a many to many Merge Join. And though this does not write to tempdb unless there actually are duplicates, it still uses more overhead in the algorithm. Plus, the additional estimated cost of writing to and reading from the worktable can even cause the optimizer to reject a many to many Merge Join and use a Hash Join instead, whereas a one to many Merge Join would have been better.

In cases where you really do need a many to many Merge Join, the optimizer tries to have the input with the least amount of duplicated values as the bottom input, in order to reduce as far as possible the activity in tempdb. But the optimizer doesn’t always have a full understanding of your data. If you see a many to many Merge Join in an execution plan, check the statistics io to see how much tempdb activity it triggers. Verify if you think the optimizer made the right choice in which input to put top and which to put bottom. If you think that it made a suboptimal choice, verify (using a FORCE ORDER hint is the easiest way to verify in test). If you are right, then try to find ways to get the optimizer to change its mind. Do not use a hint for this unless you really see no other alternative. Option such as creating or updating statistics, rewriting the query, or ensuring that all known data rules are represented as constraints and that these constraints are actually enabled and trusted, should all be tried first.

Willkommen, Bienvenue, Welcome

It may be surprising to see a word of welcome posted in December 2017, on a blog that has posts going back as far as 2006. And yet, this is my very first new blog post on this site.

All the content before this post was originally published at SQLblog.com. SQLblog.com is a fantastic website that hosts many bloggers. It has hosted me for almost twelve years as well. But it was time to move house, to create a new website for the content I have planned for the future.

For now, this website is just my blog. But that will change in the near future. I have exiting plans that I am already working on for some time. It’s too early to reveal my plans at this time (but for a small first hint, look at the domain name), but I am excited, and I hope you will be excited too when it goes live.

Three people were a great help in setting up this site and moving my existing content. Thanks, Adam Machanic, Brent Ozar, and Richie Rump. Without your efforts, this site would not have looked nearly as good as it does, and the move would not have been nearly as smooth.

If you notice anything that does not work as it should, please let me know! I will try to fix it as soon as possible. And obviously: watch this space for announcements of new content!

Three new sessions in three months

Waking up from my blog-hibernation at this time is probably not ideal. I should spend my time on other things. For instance, on creating slide decks and demo code for the three completely new sessions that I will be delivering at various conferences within the next three months.

So why am I still taking the time to write this? Because I am super excited about these new sessions, that’s why!

Hash Match, the Operator

The idea for this session is now almost two years old. I cannot remember why, during break time at another conference, I found myself wondering whether it would be possible to create a full session on just a single showplan operator. I briefly considered and quickly rejected a few unlikely candidates, until I started thinking about Hash Match … and got more enthusiastic by the minute. Suddenly, the question was not how to fill a complete session on a single operator, but how to fit everything in just one single session!

I must admit that I still hesitated a long time before deciding to actually submit the abstract for this session to conferences. But earlier this year I started including it in my submission list for SQL Saturday events, and my friends at SQL Saturday Holland decided to allow me to try it out at their venue. So now I’ll actually have to create the session, and do so in time for a smooth delivery on September 30.

I have plenty ideas in my head, and if I manage to make only half of them come true then I have really high hopes for this session. It will absolutely be a level 500 session, where I will explore internals (documented and undocumented) of everything that a Hash Match operator can do in your execution plans. I plan to discuss in detail how each of its 12 supported logical operations are executed, how memory grants are computed and shared between multiple copies of the operator, and how spills are handled. I must add, though, that I am still researching a lot of the things I plan to cover, and I still need to find the answers to some of the questions I have been asking myself.

Get Creative with Service Broker

My next new session is for the PASS Summit. And this session is a bit outside of my normal speaking areas, because it is neither on query tuning and related subjects, nor on database design and normalization. It is about Service Broker.

Service Broker was added as a feature to SQL Server back in 2005, and I have always been intrigued by its option. Though designed as a tool to support building “asynchronous, loosely coupled applications”, it is in fact a very good and flexible tool that can be used in other situations as well. However, it never gained a lot of popularity, probably mostly due to its (overly?) complex architecture. And it’s hard to change that in a conference session: almost every session on Service Broker I have ever attended follows the same pattern: start by explaining the architecture and the basics … and then an hour has passed and the session is over.

I hope my session will be different. Instead of explaining all the nitty gritty details, I will only provide a short, very simplified description of only the things you really need to know to start using Service Broker. I will supply scripts for the entire setup, so I don’t have to explain all the steps and all the components. My plan is to cover this entire theoretical part in at most 15 or 20 minutes, leaving the rest of the session for only demos from real-world case studies.

I will present three real-world applications where I implemented Service Broker to achieve something that would have been much harder, if not impossible to achieve otherwise. I will explain the problem the organization had, the way Service Broker fixed this, and show a simplified version of the actual code used to implement those solutions.

Adaptive Query Processing

Finally, just two weeks (and two intercontinental flights) after PASS, I will also speak at SQL Server Live! 360 in Orlando. And I will actually deliver not one but two sessions there. On Thursday afternoon, I will first present “T­SQL User­Defined Functions, or: How to Kill Performance in One Easy Step”, a session that I have presented many times already; it is a real fun session to present; and based on the feedback I receive the audience also likes both the presentation style and the lessons learned.

However, the new session for this conference follows immediately after, and it’s called “Adaptive Query Processing”. As indicated by the title, this session focuses on the three new SQL Server 2017 features that are collectively called Adaptive Query Processing: Batch Mode Memory Grant Feedback, Batch Mode Adaptive Join, and Interleaved Execution. Now if you are looking for a session that will briefly explain what each of these features does, then show some idealized demo where this all works perfectly and finishes this off by promising that you will never have performance problems again, look elsewhere: Microsoft has plenty of salespeople on payroll that will happily provide you with their sales pitch; I don’t feel the need to do their work for them.

Instead, I will dig a bit deeper than most other presenters do, and explain how these features actually work, how they are implemented, what functionality is exposed and what isn’t. As part of my research I plan to look for demos where these features work as intended, but also for demos where they backfire – and if I find them, I will certainly share them with you!

Can I be there?

If after reading this you are as enthusiastic about these sessions as I am, and feel you want to see them all, then I have some bad news, some good news, and some better news for you.

The bad news is that SQL Server Holland is at maximum capacity. You can still register to be put on a waiting list, but if you are not living relatively nearby that probably does not make a lot of sense. Of course, I will continue to submit the “Hash Match, the Operator” session to future SQL Saturday events, so who knows: you might be able to catch it later.

The good news is that both the PASS Summit and Live! 360 still have tickets for sale, so if you manage to get the budget and approval for a few days away from the office, emerged in deep learning, you can catch me (and lots of other fantastic speakers!) on those events.

And the better news is that, as a speaker, I can offer a discount on the admission price for Live! 360. If you use code LSPK43 when registering (or when you use this link), you get the full 5-day package for just $ 1,885. This is a saving of $ 500 on the standard price (or a saving of $300 on the Early Bird price). And unlike the Early Bird price, this offer does not expire, it remains valid until the start of the event.

Back to work!

And with that, I stop writing this blog and return to preparing my sessions. I still need to do tons of research, create several slides, and prepare lots of demos. Oh, and getting in some rehearsal time for all those new sessions might be a good idea as well.

I am already looking forward to going to those conferences, talking about my passions, and meeting old and new friends. Will you be one of them?

Misconceptions on parameter sniffing

In my previous post, I explained the basic of parameter sniffing, and then built on that to explain the lesser known mechanics of variable sniffing and cardinality sniffing. However, I also sneakily inserted a few comments on misconceptions about the performance impact of parameter sniffing, with the promise to explain in more detail later. Well … it is later now, so here is the promised explanation!

Parameter sniffing’s bad rep

Lots of people in the SQL Server field will claim that parameter sniffing is a bad feature. That is not the case. Just think about it for a moment – the feature has been in the product since at least SQL Server 2005, probably even earlier. If it were truly a bad feature, then surely just removing the feature would have been a very simple way to improve the product. But Microsoft has done no such thing, and whatever horror stories on Microsoft you believe, they have not become one of the world’s biggest companies by foolishly refusing to take opportunities to improve their software for almost no investment.

However, there is no denying that parameter sniffing can cause terrible performance issues. I will explain by using a slightly modified version of the example that my good friend Grant Fritchey likes to use. (Note that I used the AdventureWorks2012 sample database, but other versions of AdventureWorks should expose the same behaviour).

CREATE INDEX ix_Addess_City

ON Person.[Address] (City);

GO

CREATE PROC dbo.GetAddressByCity

            @City nvarchar(30)

AS

SELECT      a.AddressID,

            a.AddressLine1,

            a.AddressLine2,

            a.City,

            sp.[Name] AS StateProvinceName,

            a.PostalCode

FROM        Person.[Address] AS a

INNER JOIN  Person.StateProvince AS sp

      ON    sp.StateProvinceID = a.StateProvinceID

WHERE       a.City = @City;

GO

EXEC dbo.GetAddressByCity @City = ‘Mentor’;

SET STATISTICS IO ON;

EXEC dbo.GetAddressByCity @City = ‘London’;

SET STATISTICS IO OFF;

GO

DROP PROC dbo.GetAddressByCity;

GO

DROP INDEX ix_Addess_City ON Person.[Address];

GO

Since the tables used in this example are fairly small, you will not actually notice any slowness. But if you look at the output generated by the SET STATISTICS IO option, you should see that there actually is a serious issue with this query. The second execution of the stored procedure takes 871 logical reads on the Address table, and 868 logical reads on the StateProvince table – insane numbers when you realize that the entire Address table is stored in just 216 pages, and StateProvince is even just 3 pages in size!

If you look at the execution plan, you will start to see why this is the case:

image

An index seek is used to find Address rows matching the supplied parameter, which are then retrieved in full by the index seek, and then a clustered index seek in StateProvince is used to do the join. This type of plan is very good when the filter is expected to be very selective, which is the case for Mentor (with just one matching row), the value that was sniffed when the plan was compiled during the first call. For the second call, the same plan was then reused, but because the filter on London (with 435 matches) is far less selective the plan is now actually quite bad.

The biggest issue with such cases of bad performance caused by parameter sniffing is that they tend to be hard to troubleshoot. The performance of the GetAddressByCity stored procedure depends on the value that happens to be passed in on the first execution after the previous plan cache entry was invalidated or removed. But most of the common reasons for plan invalidation are beyond our control, so to the user of the system it can appear that the performance of this procedure changes at “random” moments. And obviously, once the DBA is ready to respond to the ticket a new recompile might already have occurred and the performance might be find again.

Another very common cause for bad parameter sniffing is the use of a single generic procedure to return rows based on a series of parameters that are all considered optional for the search. A simple example would be a procedure dbo.GetAddressByCityOrPostalCode that uses two parameters @City and @PostalCode, and that has a query with a WHERE clause such as “WHERE (City = @City OR @City IS NULL) AND (PostalCode = @ PostalCode OR @ PostalCode IS NULL)”. When this stored procedure is first called with @ PostalCode = ‘W10 6BL’ and @City = NULL, a plan will be compiled and stored in plan cache that is optimal for this combination of parameters. If a later execution uses @PostalCode = NULL and @City = ‘Mentor’, then the results returned will still be correct but the performance will likely be bad, because the query is still executed using a plan that is optimized for a search by PostalCode only. For instance, it could use a plan that uses an Index Seek in an index on PostalCode to find ows matching the second part of the WHERE (which would find all rows on the second execution), then use a Key Lookup to find the rest of the data and then filter down on City = ‘Mentor’. For this type of stored procedure you really want multiple plans stored and the one that is best for the current parameters used, but that’s just not how SQL Server works.

Bad parameter sniffing in a production system can be hard to troubleshoot. The problem only materializes when a procedure that is susceptible to bad parameter sniffing is recompiled, and the parameter values sniffed during that recompile happen to cause a plan that is bad for many other values. This is especially rare when the issue is caused by  skewed data distribution, because the really bad performance tends to be caused by a recompile for a rare outlier value, and these are not often used anyway. However, when it does hit you, it often hits you hard. When I see a database application that usually performs well but that at unexpected and unreproducible moments suddenly starts to perform bad (and then sometimes reverts to good performance later, again for no apparent reason), I always have bad parameter sniffing high on my list of likely root causes. The seemingly random switch from good to bad performance can be due to a recompile with a value that causes a bad plan. These occur at semi random moments because there are lots of reasons that can trigger a recompile (metadata change, (auto) update of statistics, memory pressure), and some of them are beyond our control.

The fanboys (and girls) are wrong

Many of the experts in the SQL Server community try to combat the misconception that parameter sniffing is bad. But they unfortunately use the wrong arguments. They say that yes, there are cases of “bad” parameter sniffing that you will have to work around (which is indeed correct), but they then also claim that apart from these exceptions parameter sniffing is “usually” or even “always” good. I have heard and read such claims from some very smart and experienced people that I personally hold in high esteem, such as Brent Ozar, Grant Fritchey, Gail Shaw, and many others. And I myself have also held and spread this belief for a long time. Until recently when I was trying to find a few examples to actually illustrate the benefit of parameter sniffing – and failed to find any!

I then realized that from a parameter sniffing point of view, there are just two types of queries: those that are affected by parameter sniffing and those that are not. A lot of queries fall in the “not affected” category. I already gave some examples above:

1.      If a query filters on the primary key, the estimated rowcount when based on a sniffed value will be 1, but the estimated rowcount for the same query without sniffing (eg when using it as an ad hoc query with a variable) will also be 1 – so we’d get the same plan without sniffing as we do with sniffing.

2.      If a query filters on a key with a fairly even data distribution, for instance with every value occurring somewhere between 90 and 110 times, then the estimate for a sniffed value can be any value between 90 and 110, and the estimate that we would get without sniffing, based on the average number of rows per value, would probably be around 100. It is extremely unlikely that such small changes in the estimate would cause a big difference in the execution plan. And in the rare cases where they do have an effect, see the below discussion on queries that are affected.

The only way for a query to fall in the “affected” category is when the column has a skewed data distribution. (Or, rare, to have values at a frequency around the threshold value between two alternative plans). In that case, there will be at least two values (A and B) for which a different plan is optimal. And here’s the rub. When the value A is sniffed, the plan is cached, and the procedure is then executed with B, the cached plan will be sub-optimal for that value. And vice versa. So the parameter sniffing in this case will always be bad. (You might argue that it is possible that the plan for A and the plan for B might both be better than the plan for the generic average rows per value; I’d have to agree that in theory this might be possible but I have never ran into, nor been able to construct, such a query).

Good parameter sniffing

Now that I have shown how parameter sniffing is usually irrelevant and sometimes bad, you might wonder why the feature exists at all. The answer is that there are some query patterns where parameter sniffing does actually help – and one of them is common enough that I would never recommend removing this feature from the product.

Lots of tables in lots of databases collect data over time. Orders keep being added in sales systems, banks keep processing transactions, and sensor data continues to poor into a table of readings. Those tables almost always have a column that timestamps the row (e.g. OrderDate, TransactionDate, MeasurementDate). And you will often find stored procedures that allow the user to select a recent subset of data by passing in a @ThresholdDate parameter – so that the results can be limited to e.g. the last hour, the last week, or the last 15 days. So the query in the stored procedure will filter on for instance WHERE OrderDate >= @ThresholdDate.

This is the type of stored procedure that will usually benefit hugely from parameter sniffing. The predicate uses inequality. Without sniffing the value, that type of filter is one of the hardest to get right for the cardinality estimator – the value can be November 3rd, 1848, or the Year 2525, or anything in between – and so the number of matching rows can be anything from the entire table to nothing. The cardinality estimator simply uses an estimate of 30% matching rows for this case. If the stored procedure is typically called to fetch just the last few days of data out of a table containing eight years of history, then that estimate is going to be horribly wrong.

This is one case where parameter sniffing really shines. It does not matter whether the sniffed value will be for a one-hour period, a one-day period, or a two-week period; in all these cases the plan based on that value will probably be the same, and definitely better than the plan you’d get for a 30% estimate. And if the cached plan is for two weeks and you next run the query for one hour, you’d still get the better performance. This single case is so common, and can have such huge impact on performance, that this use case on itself is already sufficient for me to defend the process of parameter sniffing – in spite of the occasions where it hurts, and the many cases where it’s irrelevant.

Another case of good parameter sniffing, far more esoteric, is if you have just the right combination of both a skewed data distribution and a skewed query pattern. Let’s return to Grant Fritchey’s example based on cities. This data distribution is skewed because there simply are far more people living in a city such as New York or Tokyo then there are in small villages such as Hum or Adamstown, so a stored procedure that filters on City tends to be subject to bad parameter sniffing. But what if you are working for a university and the researchers are working on a project involving only the inhabitants of very small villages? In that case, when the optimizer sniffs a value it will always be a small village and only a few rows will be returned, and again the plan based on any possible sniffed value is always going to be better for every value that is typically based, then a non-sniffed plan would be.

When good turns bad

You have to be aware, though, that good parameter sniffing, bad parameter sniffing, and irrelevant parameter sniffing, are all the same mechanism. The parameter sniffing remains the same, it’s the circumstances that make it go good or bad. And that’s something to keep in mind especially with code that relies on good parameter sniffing.

Let’s look once more at the Orders table with eight years of history, that is only ever queried for the most recent day, week or two week period. What happens if, one day, someone does actually submit a report to do some trend analysis on the last 5 years? The best and most likely result, in this case, would be that the cached plan, based on a previously sniffed parameter for a short period, is used. In that case, the report would be unbearably slow – the plan for the short period probably involves Nested Loop joins and Index Seeks, and running a few billion rows through that can take hours, or even days. The submitter of that query would suffer from bad parameter sniffing. But a worse scenario is also possible. It the previous plan has just been invalidated, a new plan will be created based on a sniffed value of five years ago, and now the plan would probably change to use scans. All other calls to the stored procedure will now start to use this plan as well, so that all those requests for short-term periods now run longer. The increased number of table scans will also start to affect overall system performance. The buffer cache fills up whenever the table is scanned so all other activity on the system has to reread from disk instead of run from cache. In short, the overall performance of the system as a whole would slow down, and it might take a long time to find the actual cause – and all that time, you would be losing sales because the web shop is now too slow for your impatient customers.

Even in a “good parameter sniffing” scenario, you should still be aware of the potential of bad parameter sniffing occurring, and take appropriate action to try to prevent it.

Dealing with bad parameter sniffing

If parameter sniffing is sometimes good, sometimes bad, and most often irrelevant, then the obvious question is: “how can I get the good without having to accept the bad as well”? And the good news is that this is possible. There are a lot of ways that allow you to deal with parameter sniffing, and I will briefly describe the options. (Note that most of them are described in much more detail elsewhere on the internet).

The most brutal way of preventing bad parameter sniffing is to disable it completely. I do not like this method. To me, it sounds like torching down your house to prevent it from being burgled. But if you insist, you can: activating trace flag 4136 will disable all parameter sniffing (good, bad, and irrelevant) on the entire instance. And yes, it is documented and supported.

A popular technique that is mostly found in old articles is to use local variables – this used to be one of the best options before SQL Server 2008. Declare a variable in the stored procedure for each parameter, assign it the value of the parameter, and then use that variable instead of the parameter in the rest of the code. The result is that SQL Server has to optimize the queries based on the variable, which it cannot sniff, so you always get a plan based on the generic data distribution rather than any sniffed value. I recently saw a recommendation somewhere to always do this for every parameter in every stored procedure – which has the same effect as using trace flag 4136, but with a lot more work. Seeing that recommendation was one of the things that triggered me to write this post. If you still need to support SQL Server 2005 or older, and if you only use it for specific parameters in specific stored procedures that are known or expected to get bad parameter sniffing, then I guess that using the local variable method can be okay. Using it in newer versions, or using it as a generic guideline for all parameters and stored procedure, is not correct.

Since SQL Server 2008, we can use OPTION (OPTIMIZE FOR (@parameter UNKNOWN)) to force SQL Server to disregard any sniffed value and optimize the query as if the value for @parameter is not known. This has the same effect as using a local variable, except it does away with the extra overhead and you can limit it to just a single query in a multi-statement stored procedure. For that reason, I prefer this method over local variables. This method works very well in situations where the plan based on generic statistics is good enough for all possible cases. These tend to be cases where irrelevant and bad parameter sniffing can occur. In situations with good parameter sniffing, the generic plan you get with this option tends to be not good enough.

Already since SQL Server 2005, you could use OPTION (OPTIMIZE FOR (@parameter_name = ‘value’)) to force SQL Server to optimize as if the value provided was sniffed. The risk of this method is that people tend to forget to maintain the value in the hint. For example in the case of retrieving the latest rows from the Orders table, this hint with a value of October 20, 2016 might work very well at this time – but if I forget to regularly update the code with a newer value, then that same hint will actually start to hurt performance in a terrible way after a few years, a year, or perhaps already after a few months. (And unfortunately, I have seen code in similar situations where a hint like this, for a datetime column, had not been changed for over three years!)

Perhaps the best solution, depending on scenario, is to use multiple stored procedures, with one master to determine which one to execute. This can work for scenarios such as the Orders table (with one procedure to search in only recent data and another to search for longer time periods, and a master calling the right one), or for generic search procedures with a limited number of possibilities (with two optional search arguments you have four possible combinations and you can create four procedures; with five parameters you have 32 possible combinations and you do not want to create that many almost similar procedures – but you could check how many combinations are actually really used). It will not work in situations with skewed data, unless you happen to know all the outlier values.

Another possible solution that works especially well for the optional search case is to use OPTION (RECOMPILE). As shown in my previous post, this enables SQL Server to sniff the variable. And since this is a different process (because the optimizer does not have to cater for later reuse of the same plan with different values), this can give you tremendous benefits, even up to the optimizer removing entire joins from the plan if they are not needed for the current set of values. But there is a price to be paid: every time the stored procedure executes the query has to be compiled, which is a relatively expensive process. Depending on how often the procedure executes and how complex the query is, just the CPU and memory cost of the recompilations could cause serious issues to the overall performance of your server. My recommendation for this hint is to definitely use it where it makes sense, and equally definitely not use it anywhere else.

And finally, specifically for the procedure with optional search arguments, you could change the stored procedure to use dynamic SQL. This can be very dangerous when not done appropriately, but if you really know what you are doing it can be a viable alternative. You would have to construct a query string that contains only the predicates (and preferably also only the joins) that are needed based on the parameters passed it. You should not hardcode the parameter values themselves in the query string, nor allow any other form of user input to find its way in the parameter string. So the conditions in the dynamically generated query string would still be of the form WHERE Column1 = @Param1 AND Column3 = @Param3 (if only @Param1 and @Param3 have a value), and you should then use sp_executesql to execute the SQL and pass in the correct parameter values. For the parameter sniffing process, using sp_executesql is exactly the same as calling a stored procedure, but now with the query string instead of the procedure name as the “owner” of the plan. The values will be sniffed, a plan will be created and stored in cache, and the plan will be reused if the same query string is executed again – which in this case is okay, because that means the same combination of set and not set search parameters was used. For a stored procedure with five optional parameters, this might eventually result in all 32 possible permutations of the query being in the plan cache with a plan that is optimal for that permutation. But without you having to write all 32 possible versions of the query. However, be aware of the danger of SQL injection, and be aware that there may be issues with permission settings if you use dynamic SQL. I recommend Erland Sommmarskog’s excellent articles on dynamic search conditions and on using dynamic SQL if you consider using this method in your production code.

How about the other sniffings?

In my previous post, I described three forms of sniffing. In this post on good, bad, or irrelevant, I have so far only focused on parameter sniffing (which includes, as briefly mentioned above, the use of sp_executesql). So what about variable sniffing and cardinality sniffing?

To start with the latter: since cardinality sniffing is really just a special case of parameter sniffing, all I wrote above applies equally to cardinality sniffing. It can be good (in some cases), it can be bad (in other cases), and it will in most cases probably have no impact. When it is bad, then OPTION (RECOMPILE) is probably going to be your only useful instrument, and if that causes too much compilation overhead you might have to rearchitect your solution to avoid this problem. If you are in charge of a database that uses table-valued parameters, and after reading this post you realize that you are benefiting or suffering from cardinality sniffing, then I’d love to hear from you in the comments section!

For variable sniffing, none of the above applies. Since variable sniffing only occurs when you use OPTION (RECOMPILE), the plan will always be optimized for the current values, and it will never be reused. There is no possibility at all of bad variable sniffing happening, since the whole concept of bad sniffing is that a plan created for one value is reused for another value, which simply never happens with variable sniffing. In other words, variable sniffing will only ever be irrelevant or good, never bad – but it will also always be associated with the cost of OPTION (RECOMPILE), since that is the only condition that triggers it.

Conclusion

In the previous post, I described three forms of sniffing: the relatively well-known parameter sniffing, the lesser known variable sniffing, and the hitherto undocumented cardinality sniffing. In this post I tried to prove that both the doomsayers who claim that parameter sniffing is the Absolute Evil (TM), and the fanboys who appear to be in total love with parameter sniffing, are incorrect.

Parameter sniffing is actually usually totally irrelevant. But it still has a reason to exist, because in at least one of the cases where it does make a difference, it can provide a huge performance benefit. That does mean, though, that we will also have to accept a few cases where it can hurt.

Because the pain caused by bad parameter sniffing can be immense, I also walked through some of the more common fixes for these problems, describing their benefit and cost and explaining when you would or would not use them. I then also explained how bad, good, and irrelevant sniffing applies to variable and cardinality sniffing.

I hope that these posts help you understand what sniffing is, how it can help or hurt, and how you can fix the latter without losing the former.

The sniffing database

Your SQL Server instances, like people with hay fever that forget to take their antihistamines during summer, is sniffing all the time. Sniffing is a trick employed by the optimizer in an attempt to give you better execution plans.

The most common form of sniffing is parameter sniffing. Many people know about parameter sniffing, but there are a lot of misconceptions about this subject. I have heard people describe parameter sniffing as a bad thing, and I know people who claim that parameter sniffing is mostly good with some exceptions that they then call “bad parameter sniffing”. Neither of these statements is true; in reality parameter sniffing (and the other forms of sniffing) are sometimes good, sometimes bad, and very often irrelevant. I will explain this in more detail in a later post – this post focuses on explaining what parameter sniffing actually is, and on what other forms of sniffing exist.

Yes, other forms of sniffing. Under the right conditions, SQL Server will also sniff variables and cardinalities. Most SQL Server developers and DBAs appear to be blissfully unaware of variable sniffing, and the few speakers and authors that do mention it tend to get it wrong. And cardinality sniffing is, as far as I know, completely undocumented. I have mentioned it a few times in some of my presentations, but never written about it – and I have never seen or heard anyone else describe this unique type of sniffing.

Parameter sniffing explained

To understand parameter sniffing, you have to know a bit about how SQL Server compiles queries into execution plans. When a query batch is submitted (either through an ad-hoc query tool such as Management Studio or the sqlcmd utility, or submitted from a client application through e.g. the ADO.Net library or JDBC), SQL Server will first try to avoid the (expensive) compilation process: it checks the plan cache to see if the same plan has been executed before and the plan is available. If that is not the case, then SQL Server will parse the entire batch, compile execution plans for each of the queries in the plan, store them in the plan cache. After that, all of the plans for the batch (either taken from the plan cache, or compiled, stored in the plan cache and then taken from it), are executed, in sequence or out of sequence as dictated by control-of-flow logic in the batch.

While compiling the query, the optimizer uses statistics about the data in the tables to estimate how many rows will satisfy any given condition. A condition such as “WHERE Age = 42” on its own is pretty broad. When you understand the data it operates on, your perception of the condition will change: in a database on men in a mid-life crisis, odds are that a rather high percentage of the rows will match; the same condition in the student database of a community college should generate at most a handful of hits. This reflects in the statistics that the optimizer uses, so the same query condition can result in different plans depending on the data distribution.

When the condition uses a variable (e.g. “WHERE Age = @Age”), then SQL Server cannot use the statistics in the same way. When the query is optimized, the optimizer knows the data type of the variable (because the parser has processed the DECLARE statement), but not the value, because it has not been assigned yet; the assignment occurs when the batch executes, after the optimization process. The optimize will still use some statistics, but not for a specific age; instead it looks at the number of distinct values used and assumes that the data is evenly distributed. So for a kindergarten database, the number of distinct values for Age would probably be three (5, 6, and 7), and the optimizer would assume 33% matching rows for any value of @Age passed in; for the US census database that same Age column would have over 100 distinct values, and the optimizer would estimate that less than 1% of the rows will match the condition for any value of @Age.

A parameter looks for most purposes exactly like a regular variable. The difference is that a parameter is declared in the header of a separately executable code unit: a stored procedure, scalar user-defined function, or multi-statement user-defined function. (And since you should as a rule not use the latter two, I’ll use a stored procedure for my example). The optimizer treats the body of a stored procedure like an ad-hoc batch: when the procedure is invoked the plan cache is first checked, and when no cached plan for the procedure is found it is generated and then stored for future reuse. The key difference is how parameters are treated. To see this in action, run the below script in the AdventureWorks2012 sample database (though it probably also works in other versions of AdventureWorks), with the option to show the actual execution plan enabled. It contains four batches, to create a sample stored procedure, invoke it twice, and then drop the procedure again. The discussion below will focus on the second and third batches, with the two EXEC statements.

CREATE PROC dbo.ParameterSniffingDemo

                @ProductID int

AS

SELECT  SUM(OrderQty)

FROM    Sales.SalesOrderDetail

WHERE   ProductID = @ProductID;

GO

— Run the procedure, then check the execution plan.

EXEC dbo.ParameterSniffingDemo @ProductID = 898;

GO

— Run the procedure again, for a different product.

EXEC dbo.ParameterSniffingDemo @ProductID = 897;

GO

— Clean up

DROP PROC dbo.ParameterSniffingDemo;

GO

The second batch in the query above, like any other batch, is first parsed and compiled. It is important to be aware that only the batch itself is compiled at this time. Once the compilation is done, SQL Server executes the EXEC statement: it sets the parameter value to 870 and then passes control to the stored procedure. Since the procedure was just created, there is no plan in cache yet, so at this time the compiler is once more invoked to generate an execution plan for the procedure. If you read this sequence of events carefully, you will realize that, unlike “normal” variables, the value of parameter @ProductID has been set before the compiler is invoked. The optimizer can read this value to use the specific statistics for ProductID 870 instead of the generic statistics it would use for a normal variable, to create an execution plan that is optimal for this “sniffed” value.

The third batch invokes the stored procedure again, with a different value passed into it. The batch itself is different from the second batch, so this batch, too, will be first compiled and then executed. And again control will pass to the stored procedure when execution starts, after setting the parameter to its value. But now the compiler will see that there already is an execution plan available for the stored procedure in the plan cache, so instead of invoking the (relatively expensive) optimization process again, it will reuse the existing plan.

image

The picture above shows the execution plan of the third batch (the second execution plan in the output, as the CREATE PROC and DROP PROC statements do not generate any actual execution plans). Evidence of the parameter sniffing process is present in the plan, but not directly visible in the graphical representation. To see the evidence, you will have to right-click the top left SELECT operator, and then click “Properties” from the context menu. This brings up the full list of properties, as shown below:

image

The “Parameter List” section is collapsed by default; I have already expanded it in the screenshot above. This is where you can see exactly what happened: there is one parameter in this query; it had a value of 898 (“Parameter Compiled Value”) when the plan was compiled and stored in cache, and during this specific execution the value of the parameter was 897 (“Parameter Runtime Value”).

Variable sniffing

Above I explain the general process of SQL Server compiling batches and entire stored procedures fully before execution starts; (non-parameter) variables do not have a value at compile time so the optimizer can only use more generic statistics. I left out a more specific part: statement level recompiles.

SQL Server may discard old execution plans and restart the compilation process for various reasons. These reasons fall generally in one of three categories: correctness (any change that might cause the existing plan to fail or to return incorrect data, like dropping an index that might be used in the plan, modifying a table, changing a constraint that might have enabled a plan simplification, etc); performance (any change that might enable SQL Server to find a much faster plan, like adding an index, rebuilding statistics, adding constraints, etc); or because you tell it to (by explicitly clearing the plan cache, or by using hints).

Some of the reasons for recompilation can occur while a multi-statement batch or stored procedure executes. SQL Server detects that before the next statement starts, and then that statement only will be recompiled. This leads to a rather unique situation: execution has started, so now the variables do have a value, which enables the optimizer to sniff that value just as it sniffs parameters. Except … well, it doesn’t. At least not always. In fact, there is only one very specific case where SQL Server will sniff the variables: when a statement-level recompile occurs due to an explicit hint.

Here is a very simple example, again using AdventureWorks2012 (and again, probably working in other versions of AdventureWorks as well):

DECLARE @Variable varchar(20);

SET @Variable = ‘B’;

SELECT  FirstName, LastName, Title

FROM    Person.Person

WHERE   LastName < @Variable

OPTION (RECOMPILE);

SELECT  FirstName, LastName, Title

FROM    Person.Person

WHERE   LastName < @Variable;

If you copy this query and, without first running it!, request an estimated execution plan, you will see the plan the optimizer creates when first compiling the batch:

image

While the batch is being compiled, the optimizer does not know the value of @Variable so it has to use a broad assumption – in the case of this type of inequality filter that assumption is that 30% of the rows will match, so the plan for the queries is based on an estimate of almost 6000 rows. The RECOMPILE hint on the second query does not affect the initial compilation of the batch; both queries get the exact same plan.

If you then actually execute the batch, with the option to include the actual execution plan enabled, you will see this:

image

The first plan has changed. The RECOMPILE hint forces SQL Server to recompile this query, and because this is a statement-level recompile, the SET statement has already executed. SQL Server can sniff the value ‘B’ in @Variable, use that to estimate that only about 900 rows will match, and then generate a plan that is optimized for that lower number of rows. You can play around with the value in @Variable to see that the plan used for the first query can actually change depending on this value.

The plan for the second query will never change. There is no RECOMPILE hint, and there were no other reasons for SQL Server to recompile this statement, so this query will always use the plan that was generated when the batch was compiled.

Fun fact: if you request an estimated execution plan after running the query, you will get the last plan that was actually used. That is because the statement-level recompile will update the plan cache entry for this batch with the new execution plan. The requested estimated execution plan will be presented to you from this plan cache entry. When you actually run the batch again, the initial compilation of the batch will also use the cached entry without redoing the full compilation process, but at run-time the RECOMPILE hint will still trigger a recompilation of the first query only – which will now result in the same plan again, unless you modified the data in the Person.Person table. This is a very important consideration. A lot of people think that adding OPTION (RECOMPILE) to a query results in the execution plan not being cached; this is not true. Those queries still result in the same memory footprint on the plan cache, with the additional cost of not only compiling the query whenever it runs, but also updating the plan in the cache on every execution.

Another interesting observation can be made when you right-click the SELECT operators on the two plans above and look at the properties. The second plan, which was not recompiled and sniffed, does have a Parameter List section in the plan, but in this case there is only a Parameter Runtime Value, no Parameter Compiled Value:

image

(The term “parameter” is confusing; within the context of execution plan properties SQL Server uses this term for every variable that is used in the query, regardless of whether they are actual parameters, or “normal”, non-parameter variables).

For the first plan, the properties of the SELECT operator do not include a Parameter List section at all, which is even more confusing – this was the section where we saw evidence of parameter sniffing, and now that a variable is sniffed there is no evidence at all! And how can SQL Server even produce the correct results without embedding the variable and its runtime value in the plan? To get an answer to that question, we’ll have to look at the properties of another operator, the Index Seek (NonClustered) at the far right. And in this case we do not even have to open the full property list, we can just hover our mouse over the operator and wait for the tooltip window to pop up:

image

This shows that the plan that is used for the statement with the OPTION (RECOMPILE) hint does not reference @Variable at all. What actually happened is that, during the statement level recompile, the parser inspected the current value of @Variable and then replaced all references to this variable in the query (or rather, in the internal representation of the query) by the value. The input that was finally received by the optimizer was exactly the same as if we had submitted the query below:

SELECT  FirstName, LastName, Title

FROM    Person.Person

WHERE   LastName < N’B’

OPTION (RECOMPILE);

For “normal” parameter sniffing, this would not be a safe implementation. The next call to the stored procedure could pass a different parameter value, and if the sniffed value were hardcoded in the plan, then the results would obviously be incorrect. That’s why a plan with parameter sniffing must keep the parameter; it optimizes for the sniffed value, but will also produce correct results for other values. In the case of variable sniffing, there is no need to keep the variable as a variable in the execution plan. A cached plan will only be reused if there is a full textual match on the full query text, which includes the OPTION (RECOMPILE) hint – and that option guarantees that the cached plan will be overwritten with a new plan rather than be reused. So the choice to implement variable sniffing differently, by directly injecting the values into the plan, is safe.

But that same safe choice also results in a loss of variable sniffing in other scenarios. When a statement-level recompile occurs for a different reason, for instance because enough rows were added to a table to trigger an automatic update of the statistics on that table, then no sniffing is done. This makes sense when you consider what would otherwise happen. Suppose I have a batch and the third statement uses a variable; just before it executes the threshold for automatic statistics update is hit so new statistics pop up and the optimizer recompiles the query. If it would sniff the variable, it would hard-code the sniffed value in the plan, then replace the old cached plan with the new plan. Next time I execute the same batch, the variable can have a new value – but the text of the batch has not changed, and it is unlikely that there will be a statement-level recompile for another reason. So the updated plan from the cache, that now includes a hard-coded value, would be used – but that value is no longer correct and wrong results would be returned. That is of course not acceptable, and for that reason SQL Server will only sniff variables when a statement-level recompile is forced by the OPTION (RECOMPILE) query hint.

In theory, it would have been possible for Microsoft to implement a second form of variable sniffing, using the same method as for parameter sniffing. In reality, that is not the choice Microsoft made. In order to ensure that a statement-level recompile that is not caused by OPTION (RECOMPILE) produces a “safe” plan, the variables are simply not sniffed. The new plan will be based on the new statistics, but it will still use the generic estimates.

Cardinality sniffing

Table variables behave in many ways exactly the same as normal variables. This includes the optimization process. So when a batch or stored procedure is submitted, the parser interprets the declaration of the table variable and then compilation starts without any data being in the table variable. Based on actual reality at the time of compilation, one might expect an estimate of zero rows. However, the optimizer always assumes at least one row (unless the query includes a filter that cannot possibly be true). If you run the query below, you will see that the first two SELECT queries both have an estimated rowcount of 1, even though the actual rowcount is 0 for the first query and 19972 for the second.

DECLARE @TabVar table

       (PersonID int NOT NULL PRIMARY KEY);

SELECT  PersonID

FROM    @TabVar;

INSERT  @TabVar

SELECT  BusinessEntityID

FROM    Person.Person;

SELECT  PersonID

FROM    @TabVar;

SELECT  PersonID

FROM    @TabVar

OPTION (RECOMPILE);

In the case of the simple queries above, this huge mistake in cardinality estimation does not affect the execution plan. But in more complex queries this can often result in very slow running queries. This is one of many reasons why most people tend to prefer temporary tables over table variables unless they know for sure that there will never be more than a handful of rows involved. This problem can be avoided by adding OPTION (RECOMPILE), as shown in the last SELECT above. This forces a statement-level recompile and now the actual number of rows in @TabVar can be used by the optimizer. This can often help prevent execution plans with terrible performance, but at the price of a full compilation for the statement every time it executes.

All of the above is pretty well known and described at many places; I only include this basic information because it is important to understand what “cardinality sniffing” is. Cardinality sniffing is related to parameter sniffing, because it only relates to parameters passed into stored procedures (and other executable code modules). It is also related to table variables. In fact, cardinality sniffing is specifically related to table-valued parameters – table variables passed as a parameter into a stored procedure. Here is an example:

— Define a type for the table variables

CREATE TYPE TabType AS TABLE

       (PersonID int NOT NULL PRIMARY KEY);

GO

— Create a stored procedure that uses the table variable

CREATE PROC dbo.SniffCardinality

       @InputTable TabType READONLY

AS

SELECT  PersonID

FROM    @InputTable;

GO

— Create and populate a table variable

DECLARE @TabVar AS TabType;

INSERT  @TabVar

SELECT  BusinessEntityID

FROM    Person.Person;

— Invoke the stored procedure the first time

EXEC dbo.SniffCardinality @TabVar;

— Now remove some of the rows from the table variable

DELETE  @TabVar

WHERE   PersonID < 20000;

— Invoke the stored procedure again

EXEC dbo.SniffCardinality @TabVar;

GO

— Clean up

DROP PROC dbo.SniffCardinality;

DROP TYPE TabType;

GO

If you run the batch above with the option to include the actual execution plan enabled, you will see a bunch of plans for all the queries in the last batch. The second and fourth are the executions of the stored procedure. Previously when we ran the same SELECT statement in a batch, the estimated number of rows for the table variable was 1. Now the estimates are different, as shown in the picture below (note that I edited the picture to remove irrelevant execution plans and include two tooltip windows):

image

As you see, the first estimate is exactly correct. That is because the table-valued parameter is used by the optimizer just as it uses other parameters. So when the stored procedure is first executed and no plan for it is in cache yet, it will sniff the table-valued parameter just as it sniffs other parameters. Now it is important to realize that this sniffing is limited. The optimizer will not read data from the table variable to get insight on the data in it. It only sniffs the metadata that is available at the time of compiling the plan. And because table variables do not have statistics, that metadata is limited to only the number of rows in the table variable that is passed into the stored procedure for the first execution.

The second estimate is wrong. After the first call to the stored procedure we deleted most of the rows from the table variable, but the execution plans that are used for the stored procedure are all still optimized on the original row count. That is because the execution simply reused the previously compiled plan from the plan cache. A change in the cardinality on the table-valued parameter, like a change to the value of a regular parameter, is no reason to recompile the execution plan for the stored procedure.

I already mentioned that cardinality sniffing has, to the best of my knowledge, never been documented before. That is not really surprising. I have yet to see my first table-valued parameter in “real” client code, and I hear almost nobody ever talk about them, so the feature appears to be pretty niche. However, when you happen to work on a system that does use table-valued parameters, then you might be faced with cardinality sniffing and its possible performance consequences. (And if not, then you can at least add “Hey, would you like to sniff my cardinality?” to your collection of useless pickup lines).

Conclusion

In this post, I first elaborated a bit on the relatively well-known concept of parameter sniffing, giving some background information into the internals at play in this process. I then went on to explain the much lesser known concept of variable sniffing, and proved that (unlike popular opinion by those who do talk or write about this concept) it actually uses quite different internal mechanisms from parameter sniffing, and that for this reason variable sniffing only happens when a recompile is manually enforced.

In the last paragraph I then introduces a previously unknown form of sniffing: cardinality sniffing. In a process similar to parameter sniffing, the number of rows in a table variable can be sniffed when it is passed as a table-valued parameter into a stored procedure.

In the opening paragraph I already hinted that parameter sniffing, though well-known, is also very misunderstood. It is neither “bad” (as suggested by many experts), nor “mostly good with an occasional exception” (as suggested by most others), but actually “sometimes good, sometimes bad, and mostly irrelevant”. I will explain this in my next blog post on this subject.

Upcoming presentations. And an offer.

It’s not my habit to announce my future presentations on this blog. But I sometimes make exceptions to this rule, and today is a good day for such an exception. Especially since I have been allowed to offer some money off for one of my upcoming presentations.

I’ll fly to Copenhagen tomorrow, for SQL Saturday Demark, where I will present two sessions: a 60-minute talk (“T-SQL User-Defined Functions, or: Bad Performance Made Easy”) about how T-SQL user-defined functions can wreck your performance, and a 10-minute lightning talk (“Managing Execution Plans”) about the similarities between execution plans and my coworkers.

Three weeks later, on October 8, I will be in Germany for SQL Saturday Munich, where I will once more present the “T-SQL User-Defined Functions, or: Bad Performance Made Easy” session – so if you cannot make it to Denmark in time, feel free to sign up for Munich!

Late October, you can find me in Seattle at the 2016 PASS Summit. If you’ve never heard of this conference, then you are really missing out on something. It’s three days (or five if you include pre-cons) packed full with literally hundreds of sessions, delivered by high end speakers and Microsoft staff, access to Microsoft support staff, and a chance to network with thousands of your peers. Plus, it’s also a lot of fun. And I am very happy to say that I have been selected to present two sessions this year. On October 27, I will deliver the lightning talk “Managing Execution Plans”, in a lightning talk session that also features Rob Volk, Russ Thomas, and Wayne Sheffield. And on October 28, I will present a 75-minute deep-dive session “Now Where Did THAT Estimate Come From?” on how SQL Server uses statistics and assumptions to estimate how many rows will be returned by a query, or in intermediate steps in the execution plan.

And your last opportunity to see me this year will be in Orlando, December 5-9. I will be speaking at Live! 360, a single event that encapsulates six co-located conferences: AppDev Trends, Modern Apps Live!, Office & Sharepoint Live!, SQL Server Live!, TechMentor, and Visual Studio Live!. A single ticket gives access to all these conferences. I will of course speak in the SQL Server track. On the afternoon of December 8, I have been scheduled to present two back to back sessions (perhaps to ensure that you’ll really want to leave at the end of the day??). In “Powerful T-SQL Improvements that Reduce Query Complexity”, I will explain how the introduction and later improvements to the OVER clause has simplified code that used to be horrendous, and in “T-SQL User-Defined Functions, or: Bad Performance Made Easy”, I will once more explore the do’s and don’ts of T-SQL user-defined functions, this time with a bit extra depth because I have 75 minutes available.

Save money on Live! 360

I of course hope to see all of you at all those conferences. But I understand that not everybody has the budget to afford them all. The two SQL Saturday are easy, since they are free, but both the PASS Summit and Live! 360 are paid conferences. I absolutely believe that it’s money well spent because you get an absurd amount of training for your money at those conferences – but it still has to come out of your pocket, or out of a training budget, and those tend to be finite.

In the case of Live! 360, I might be able to help out. A bit. The retail price of the full five-day conference currently sells at $1,995 (Super Early Bird), and will go up to $2,095 (Early Bird) after October 5, and then to $2,395 (Standard Pricing) after November 2. However, I have been allowed to offer readers of my blog a discounted price of $1,895, and that offer remains valid until just before the conference (or until the conference sells out, whichever comes first). The only thing you need to do if you want to make use of this offer is to enter the Priority Code LSPK43 when registering. Or you can simply use this link, which pre-populates the code for you so you cannot even forget it!

(Unfortunately, the discount cannot be applied to existing registrations or combined with other discounts. It is valid for the 5-Day Package only)

Shake hands?

With four conferences coming up in the next few months, I hope that a lot of my readers will find an opportunity to attend at least one of those. If you do, then please feel free to find me in between sessions, introduce yourself, and have a friendly chat.

The DIY guide for local-global aggregation

If you have read the title, you will not be surprised when I tell you that this blog is about a thing called “local-global aggregation”, and on how to do that yourself. So let’s start with the two obvious questions: what the heck is local-global aggregation anyway, and why the heck would you ever want to do it yourself?

What is local-global aggregation

The term local-global aggregation may sound scary, but it’s not that hard to understand. You might even already know the concept if you have ever heard the arguably most used explanation of parallelism. You have a jar of beans and need to count the beans. Instead of doing it all by yourself, you trick a few of your gullible friends into helping you, with the promise of a nice beer after an “easy task”. Each of them gets a few scoops of beans, counts them, and writes down the number they have. You collect the papers, sum up the numbers, and the result is the amount of beans that was in the jar before you started scooping. Congratulations! You have just done a local-global aggregation.

A more technical explanation that wraps this example back to SQL Server (or, in fact, any data processing application) is that a set of data is divided into subgroups (your friends each getting their share of beans). Each of these subgroups is then aggregated on its own (your friends counting their beans); this is the “local” part of the aggregation, also called “partial” aggregation (a term that in my opinion actually better describes what is going on, but unfortunately “local” is the more commonly used term). The last step is to combine the separate local aggregation results into a single final result (you tallying up the numbers counted by your friends); the “global” aggregation step. Note that the overall amount of work is not less when using this pattern. However, because the work can now be spread over multiple workers, the task still finishes faster. On a single-threaded platform, it would not make sense to work like this.

The basic concept sounds simple, but it can get more complex depending on what type of aggregation you need done. What if you need to know the minimum and maximum weight of the beans in the jar? You could buy a few precision scales, gather your friends again, and have each of them start weighing beans to find the minimum and maximum in their share; you would then just select the lowest and highest weight from all their results to get your final result. But how about the average weight? You could once more get out the scales and gather your friends (assuming you still have any after making them weigh thousands of beans for a single beer for the previous exercise) and have each of them compute the average weight for their share of beans … but then what? How do you get the global average from the local averages? Problem is: you can’t. To understand why, here is an extremely simple example, using just two friends and three beans. Friend 1 gets two beans weighing 1.1 and 1.3 gram; friend 2 gets the third bean which weighs 1.5 gram. You get two results from them: an average weight of 1.2 and an average weight of 1.5. Now how do you combine those two numbers to get the correct average of all beans, which is 1.3? Again, you can’t.

This does not mean that you can’t use local-global aggregation to get an average, it just means that you have to be smarter. You should have asked your friends to tell you the number of beans and their total weight. From those local aggregates, you could then compute the global aggregates, total weight and total number of beans (3.9 gram and 3 beans), and then computed the average by dividing those numbers (3.9 / 3 = 1.3 gram). So to do a correct local-global aggregation for an average aggregate, you have to get different aggregates at the local level, which can then be combined in the correct formula to get the aggregate you need. For an average, this is fairly simple. But once we start looking at statistical aggregate functions such as standard deviation and variance, the trick becomes more complicated. The basic trick remains the same (compute some other aggregate at the local level, then combine with the appropriate formula), but the actual formula itself becomes more complex. More on that later.

Local-global aggregation done by the optimizer

You usually don’t need to worry about local-global aggregation. Minds far greater than ours have done all the hard work and implemented this logic in the query optimizer, so everything works smoothly without us having to do anything special for it.

The most obvious place where you expect this to happen is in parallel plans. Here is a very simple example, based on the Microsoft Contoso BI Demo database which can be found here. (I normally prefer to base my examples on AdventureWorks, but the tables in that database are too small to get parallelism on simple example queries).

SELECT AVG(TotalCost)

FROM dbo.FactOnlineSales;

The execution plan for this extremely simple query looks like this:

image

The query has just a single aggregation, but the plan has two Stream Aggregate operators. The rightmost Stream Aggregate, running in the parallel section, does the local aggregation. If you open the properties and look at the Defined Values aggregate, you will see that it computes two aggregates: “partialagg1004=Count(*)” and “partialagg1006=SUM(TotalCost)” (I have simplified these slightly from the actual expressions in the plan). To the left, in the serial section of the plan, a second Stream Aggregate operator computes the global aggregates: “globalagg1005=SUM(partialagg1004)” and “globalagg1007=SUM(partialagg1006)”. So globalagg1005 is the total number of rows, and globalagg1007 is the grand total of all TotalCost values. The Compute Scalar operator then uses those two global aggregates to compute the average that the query actually requested: “Expr1003=CASE WHEN globalagg=0 THEN NULL ELSE globalagg1007 / globalagg1005” – in other words, total cost divided by number of rows, or NULL if the number of rows was zero.

Parallel plans are not the only place where the optimizer will use local-global aggregation. In some other queries it can use this as a trick to save work. See the query below for an example (this time I did base the query on AdventureWorks, because I want to have a serial plan).

SELECT     p.Color, SUM(sod.OrderQty)

FROM       Sales.SalesOrderDetail AS sod

INNER JOIN Production.Product AS p

      ON   p.ProductID = sod.ProductID

GROUP BY   p.Color;

The query requests an aggregation by colour, but because colour isn’t stored in the SalesOrderDetail table we need to join to the Product table. The most straightforward execution plan for this would look something like this (which I made up by using copy and paste from various other plans – we can’t all know all undocumented trace flags by head):

image

This made-up plan is a very straightforward implementation of the query: join every row from SalesOrderHeader to every row from DimProduct, then aggregate by colour and present the results. In this plan, the join has to operate on all 121,317 rows from SalesOrderDetail, which is estimated to contribute over a quarter of the total cost of the plan.

The execution plan that you actually get for the query above (when running on SQL Server 2012) looks like this:

image

You will see two key differences: far less input to the join operator (just 266 rows), and two instead of just one operators for a local-global aggregation pattern. The Hash Match (Aggregate) operator to the right and in the lower branch does the local aggregation, and the Stream Aggregate operator to the left does the global aggregation. I will not dive into the exact Defined Values properties for each of them, as they are pretty basic (since the query merely requests a SUM). In this case, it is far more interesting to look at the grouping level.

The global aggregate obviously has to group by colour, as requested in the query, and this is confirmed if you look at the Group By property of the Stream Aggregate. But what happens in the local aggregation in the Hash Match (Aggregate) operator? The Product table is not joined yet so we do not have the colour available. To see how the groups are defined here, we’ll have to look at the Hash Keys Build property of this operator, which tells us that the local aggregation groups by ProductID.

What happens here is that the global aggregation, which has a group for each colour, is subdivided into a group for each product. Some colours might have be used for just a single product, but most are used in multiple products. You can see this by looking at the number of row property of various operators – after grouping by ProductID the 121,317 rows from SalesOrderDetails are reduced to just 266 rows for 266 distinct products, and after the final aggregation we are left with just 9 rows, for nine colours.

As before, the local-global aggregation pattern does not save work by itself; it actually introduces a bit of extra work. And in this case the plan is serial so we never have more than one worker. And yet this pattern makes sense here – because the extra work introduced  by the local-global aggregation is far less than the work saved by having to join 266 instead of 121,317 rows.

Local-global aggregation done by you

So far I have shown you what local-global aggregation is, and how the optimizer will use this when using this pattern can improve the overall query performance. It is now time to address the “DIY” part of the title.

There may be situations where a query might benefit from local-global aggregation, but the optimizer can’t or doesn’t use it by itself. This happens very often when you rewrite queries on a columnstore table to work around the limitations of batch mode execution; I cover this extensively in levels 10 and 11 of the Stairway to Columnstore Indexes (not yet published at the time of writing this blog post). It can also, less frequently, happen in other situations.

To see an example, let’s return to the AdventureWorks sample database. Two of its tables are Production.TransactionHistory and Production.TransactionHistoryArchive. The second one is used for older data that will not change anymore and is less frequently queried. There are lots of ways to implement such a business need; this one was chosen for AdventureWorks (and that I have also seen in a lot of real companies).

The management dashboard includes a view on these tables that shows aggregated data from these tables, as follows:

CREATE VIEW DashBoard_ProductsSold

WITH SCHEMABINDING

AS

SELECT   ProductID, SUM(Quantity) AS TotalSold

FROM    (SELECT ProductID, Quantity

         FROM   Production.TransactionHistory

         UNION ALL

         SELECT ProductID, Quantity

         FROM   Production.TransactionHistoryArchive) AS c

GROUP BY ProductID;

Lately, the DBA team has noticed that more and more managers get access to this dashboard, and some tend to refresh quite often. And every time the dashboard is refreshed, all 200K rows in the two tables have to be read. This is starting to affect overall performance of the system, so we want to index this view. The benefit of an indexed view is that the results are stored, so now refreshing the dashboard is just a simple index scan. The downside is that the stored results need to be updated whenever data changes so there will be some overhead on modifications. In this case, we expect this overhead to be outweighed by the saved IO and processing whenever the dashboard is loaded or refreshed.

However, when we try to index the view we get an error:

CREATE UNIQUE CLUSTERED INDEX cix_DashBoard_ProductsSold

ON dbo.DashBoard_ProductsSold (ProductID);

Msg 10109, Level 16, State 1, Line 1

Cannot create index on view “AdventureWorks2012.dbo.DashBoard_ProductsSold” because it references derived table “c” (defined by SELECT statement in FROM clause). Consider removing the reference to the derived table or not indexing the view.

There are lots of limitations for indexed views, and this is one of them. Because the data is in two different tables, we have to combine it and then aggregate the combined result. That isn’t possible without using a derived table or another similar (and also forbidden) construction, so we will have to find another way to optimize the dashboard. And that’s where DIY local-global aggregation comes in. In this case the data already is divided in two smaller groups (the two tables); instead of combining the data and then aggregating, we can aggregate each tables individually and then combine the results. Let’s show this in two steps.

The first step is for the local aggregation. For this, we need to create two new views:

CREATE VIEW DashBoard_ProductsSold_History

WITH SCHEMABINDING

AS

SELECT   ProductID, SUM(Quantity) AS TotalSold

       , COUNT_BIG(*) AS NumRows            — Added to satisfy indexed view req’ment

FROM     Production.TransactionHistory

GROUP BY ProductID;

GO

CREATE VIEW DashBoard_ProductsSold_History_Archive

WITH SCHEMABINDING

AS

SELECT   ProductID, SUM(Quantity) AS TotalSold

       , COUNT_BIG(*) AS NumRows            — Added to satisfy indexed view req’ment

FROM     Production.TransactionHistoryArchive

GROUP BY ProductID;

These two views can each be indexed without problems:

CREATE UNIQUE CLUSTERED INDEX cix_DashBoard_ProductsSold_History

ON dbo.DashBoard_ProductsSold_History (ProductID);

CREATE UNIQUE CLUSTERED INDEX cix_DashBoard_ProductsSold_History_Archive

ON dbo.DashBoard_ProductsSold_History_Archive (ProductID);

And then, as the final step, we change the original view to return the same results by performing global aggregation on the locally aggregated data from the new views:

ALTER VIEW DashBoard_ProductsSold

WITH SCHEMABINDING

AS

SELECT   ProductID, SUM(TotalSold) AS TotalSold

FROM    (SELECT ProductID, TotalSold

         FROM   dbo.DashBoard_ProductsSold_History WITH (NOEXPAND)

         UNION ALL

         SELECT ProductID, TotalSold

         FROM   dbo.DashBoard_ProductsSold_History_Archive WITH (NOEXPAND)) AS c

GROUP BY ProductID;

(The NOEXPAND hints are added to ensure that this works on any edition of SQL Server; on the Enterprise and Developer editions the hints are not required but they do not hurt either).

The last view cannot be indexed because of the derived table. But this view does not need to be indexed, because the local aggregation views already are indexed. If you query this view and check the execution plan, you’ll see less than 1000 rows read from the two views, as opposed to the 200K rows we were reading before.

With these same local-aggregated indexed views as input, I can easily extend the dashboard to also show an average, or a count of rows:

ALTER VIEW DashBoard_ProductsSold

WITH SCHEMABINDING

AS

SELECT   ProductID,

         SUM(TotalSold) AS TotalSold,

         SUM(NumRows) AS NumRows,

         SUM(TotalSold) * 1.0 / SUM(NumRows) AS AvgSold

FROM    (SELECT ProductID, TotalSold, NumRows

         FROM   dbo.DashBoard_ProductsSold_History WITH (NOEXPAND)

         UNION ALL

         SELECT ProductID, TotalSold, NumRows

         FROM   dbo.DashBoard_ProductsSold_History_Archive WITH (NOEXPAND)) AS c

GROUP BY ProductID;

Adding a minimum or a maximum is slightly more work because I would need to add the MIN and MAX functions to the local-aggregated views, but even this is hardly rocket science. However, things gets complicated when management asks to include more advanced statistical information.

Those pesky statistical functions

SQL Server offers out of the box four statistical functions that are more advanced then AVG: VAR, VARP, STDEV, and STDEVP. These, too, can be used in a local-global aggregation pattern. But that is more complex than for the aggregate functions we have seen so far.

It is possible that you, like me, are not a daily user of these functions. And it is even possible that your high school days are so far back that you don’t really remember what they are about. But if you work with SQL Server, then I believe that you should at least know that they exist and (roughly) know what they are and what they are used for. A very short explanation is that both variance and standard deviation are a measure of how spread out the values are. The set of numbers {0, 50, 100} has the same average as {49, 50, 51}. But the distribution is very different, and the standard deviation and variance indicate that difference.

The variance is defined in terms of the difference between each value and the average. The typical algorithm is to first calculate the average, then return to the data set and find the difference between each value and that average. The standard deviation is simply the square root of the variance. A small added complexity is that there is a subtle change in the calculation depending on whether you compute it based on all data or on a sample – that is why SQL Server has VAR and STDEV for the variance and standard deviation of a sample and VARP and STDEVP for the variance and standard deviation of the full population (the extra P stands for population). See here for a slightly deeper (but still fairly accessible) explanation.

The problem with the standard algorithm for variance and standard deviation is that it requires two passes over the data: first the average must be found, and then we need a second pass to subtract each value from that average. For computers, where I/O typically costs more than computation, we don’t want to use two passes. Luckily, smart mathematicians have found many other algorithms to compute the same result, and some of them can be implemented with a single pass over the data. The formula that SQL Server uses internally when executing a query that uses a statistical aggregate function is the so-called “Naïve algorithm”, and this is also the formulas that we will use for our DIY local-global aggregation.

Without going too deep into the formulas, I will now show the code that you can use for the local-global aggregation pattern on statistical functions. Note that in this case I have to drop and recreate all views because an indexed view does not support ALTER. Also note that the views needs to use an ISNULL function because SQL Server misinterprets the SQUARE function as being nullable even when the input is not nullable; this nullability would trigger yet another limitation on indexed views.

DROP VIEW DashBoard_ProductsSold;

DROP VIEW DashBoard_ProductsSold_History;

DROP VIEW dbo.DashBoard_ProductsSold_History_Archive;

GO

CREATE VIEW DashBoard_ProductsSold_History

WITH SCHEMABINDING

AS

SELECT   ProductID,

         SUM(Quantity) AS TotalSold,

         SUM(ISNULL(SQUARE(Quantity),0)) AS TotalSquared,

         COUNT_BIG(*) AS NumRows

FROM     Production.TransactionHistory

GROUP BY ProductID;

GO

CREATE VIEW DashBoard_ProductsSold_History_Archive

WITH SCHEMABINDING

AS

SELECT   ProductID,

         SUM(Quantity) AS TotalSold,

         SUM(ISNULL(SQUARE(Quantity),0)) AS TotalSquared,

         COUNT_BIG(*) AS NumRows

FROM     Production.TransactionHistoryArchive

GROUP BY ProductID;

GO

CREATE UNIQUE CLUSTERED INDEX cix_DashBoard_ProductsSold_History

ON dbo.DashBoard_ProductsSold_History (ProductID);

CREATE UNIQUE CLUSTERED INDEX cix_DashBoard_ProductsSold_History_Archive

ON dbo.DashBoard_ProductsSold_History_Archive (ProductID);

GO

CREATE VIEW DashBoard_ProductsSold

WITH SCHEMABINDING

AS

SELECT   ProductID,

         SUM(TotalSold) AS TotalSold,

         SUM(NumRows) AS NumRows,

         SUM(TotalSold) * 1.0 / SUM(NumRows) AS AvgSold,

         CASE WHEN SUM(NumRows) > 0

              THEN (SUM(TotalSquared) – (SQUARE(SUM(TotalSold)) / SUM(NumRows))) / SUM(NumRows)

         END                            AS [VarP],

         CASE WHEN SUM(NumRows) > 0

              THEN SQRT((SUM(TotalSquared) – (SQUARE(SUM(TotalSold)) / SUM(NumRows))) / SUM(NumRows))

         END                            AS [StDevP],

         CASE WHEN SUM(NumRows) > 1

              THEN (SUM(TotalSquared) – (SQUARE(SUM(TotalSold)) / SUM(NumRows))) / (SUM(NumRows) – 1)

         END                            AS [Var],

         CASE WHEN SUM(NumRows) > 1

              THEN SQRT((SUM(TotalSquared) – (SQUARE(SUM(TotalSold)) / SUM(NumRows))) / (SUM(NumRows) – 1))

         END                            AS [StDev]

FROM    (SELECT ProductID, TotalSold, TotalSquared, NumRows

         FROM   dbo.DashBoard_ProductsSold_History WITH (NOEXPAND)

         UNION ALL

         SELECT ProductID, TotalSold, TotalSquared, NumRows

         FROM   dbo.DashBoard_ProductsSold_History_Archive WITH (NOEXPAND)) AS c

GROUP BY ProductID;

GO

A small footnote – the square root expression in the calculation for standard deviation can, by mathematical standards, never be negative. But rounding errors, especially with floating point data, introduce a theoretical possibility that the computer runs into a small negative number, which could cause a run-time error. The chance of this happening is incredibly small, but if you want to make sure that this never happens, use this longer but more defensive version instead:

ALTER VIEW DashBoard_ProductsSold

WITH SCHEMABINDING

AS

SELECT   ProductID,

         SUM(TotalSold) AS TotalSold,

         SUM(NumRows) AS NumRows,

         SUM(TotalSold) * 1.0 / SUM(NumRows) AS AvgSold,

         CASE WHEN SUM(NumRows) = 0

              THEN NULL

              WHEN (SUM(TotalSquared) – (SQUARE(SUM(TotalSold)) / SUM(NumRows))) / SUM(NumRows) < 0

              THEN 0

              ELSE (SUM(TotalSquared) – (SQUARE(SUM(TotalSold)) / SUM(NumRows))) / SUM(NumRows)

         END                            AS [VarP],

         CASE WHEN SUM(NumRows) = 0

              THEN NULL

              WHEN (SUM(TotalSquared) – (SQUARE(SUM(TotalSold)) / SUM(NumRows))) / SUM(NumRows) < 0

              THEN 0

              ELSE SQRT((SUM(TotalSquared) – (SQUARE(SUM(TotalSold)) / SUM(NumRows))) / SUM(NumRows))

         END                            AS [StDevP],

         CASE WHEN SUM(NumRows) <= 1

              THEN NULL

              WHEN (SUM(TotalSquared) – (SQUARE(SUM(TotalSold)) / SUM(NumRows))) / (SUM(NumRows) – 1) < 0

              THEN 0

              ELSE (SUM(TotalSquared) – (SQUARE(SUM(TotalSold)) / SUM(NumRows))) / (SUM(NumRows) – 1)

         END                            AS [Var],

         CASE WHEN SUM(NumRows) <= 1

              THEN NULL

              WHEN (SUM(TotalSquared) – (SQUARE(SUM(TotalSold)) / SUM(NumRows))) / (SUM(NumRows) – 1) < 0

              THEN 0

              ELSE SQRT((SUM(TotalSquared) – (SQUARE(SUM(TotalSold)) / SUM(NumRows))) / (SUM(NumRows) – 1))

         END                            AS [StDev]

FROM    (SELECT ProductID, TotalSold, TotalSquared, NumRows

         FROM   dbo.DashBoard_ProductsSold_History WITH (NOEXPAND)

         UNION ALL

         SELECT ProductID, TotalSold, TotalSquared, NumRows

         FROM   dbo.DashBoard_ProductsSold_History_Archive WITH (NOEXPAND)) AS c

GROUP BY ProductID;

Note that in real cases, the source of the local aggregation can be diverse. It can be two or more views as in the example above, but it can also be a subquery with a different grouping level than the final query, or some other subquery, view or CTE. In all cases, the steps to build a DIY aggregate for variance and standard deviation are the same: ensure that the local aggregation computes the three ingredients (count of rows, sum of values, and sum of squares), then use the appropriate formula as shown above for the final result.

Conclusion

Instead of computing an aggregate function of a full data set, it is sometimes a better idea to divide the data in smaller portions, aggregate each individually, and then compute the final aggregates by combining the aggregations of the smaller sets (the partial aggregates). This is called local-global aggregation.

There are cases where the optimizer will do this automatically for you. Parallelism is a prime example of this, but the optimizer can also use local-global aggregation for other optimizations.

In some cases you need to explicitly force local-global aggregation in your queries. The batch execution mode, one of the reasons for the huge performance gain that columnstore indexes can achieve, has lots of limitations that require query rewrites for optimal performance; these often necessitate local-global aggregation patterns in the query. And though many of the batch mode limitations that existed in SQL Server 2012 were fixed in later versions, there are even in SQL Server 2016 still a few queries that need help to get batch mode execution.

Even when not using columnstore indexes, there are still situations that require us to write a manual local-global aggregation pattern. An example of this based on a current and an archive table is used in this blog post to demonstrate the methods to do this.

For the aggregate functions that are most commonly known, the local-global aggregation pattern is fairly easy to implement and understand. For more advanced statistical aggregate functions, such as standard deviation and variance, the formulas to use can get quite complex. In this blog post I showed a query that includes all these formulas.

I hope this blog post can help you when you ever need to write a query with a local-global aggregation pattern – regardless of whether you know that term or not!

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.

Close