In his invitation, Steve specifically asks for examples where Window functions provided a neat solution to a real world problem. Well, sorry Steve, but I am not going to do that. But your invitation did inspire to me write about the execution plans for these window functions. And there is, in fact, so much to write about it, that this is just the first part.
So that makes this post not only a T-SQL Tuesday contribution, but also part 23 in my ongoing plansplaining series: blog posts where I take an in depth look at execution plans to explain how exactly they work, and point out often overlooked details. In this post, I will look at the basics of window functions, as they have existed for over 10 years now. I will point out a few interesting performance gotchas. And there are some links to feedback items that you can upvote (if you are so inclined) to pressure Microsoft to make some changes to the optimizer and the execution engine, to get some small but still welcome performance gains.
Windows without a frame
Window functions were actually not introduced in SQL Server 2012 (sorry, Steve!), but already in SQL Server 2005. However, they only had a limited syntax back then. We could define windows, but only without frames. So when looking through the window, we would always see an entire partition. Here is an example of a query that works on SQL Server 2005 and later:
SELECT p.ProductID, p.Name, p.ListPrice, p.ProductLine, AVG (p.ListPrice) OVER (PARTITION BY p.ProductLine) AS ProdLineAvgPrice, ROW_NUMBER () OVER (PARTITION BY p.ProductLine ORDER BY p.ListPrice) FROM Production.Product AS p ORDER BY p.ProductLine, p.ListPrice;
In SQL Server 2005, 2008, and 2008R2, this was basically the only permitted syntax. The PARTITION BY clause was optional (omitting it would define the entire result set as one single partition). The rest was fixed.
So it was back then not allowed to specify an ORDER BY clause in an OVER clause that was applied to a regular aggregate function. You could only use the PARTITION BY clause, and the effect was to return the specified aggregate for a window that was defined as the entire partition. So the query above will, in each row, include the average ListPrice of all products with the same ProductLine.
However, an ORDER BY specification was allowed (and in fact even mandatory) in an OVER clause for one of the ranking functions: ROW_NUMBER, RANK, DENSE_RANK, and NTILE. This clause is needed to impose a logical ordering (‘logical’, meaning that the final result set need not be returned in that order, which is why the query above needs an ORDER BY to guarantee that the results are ordered as I want them). That logical ordering is used to determine the row number (or rank, etc) of each row, within the partition (window). So in the query above, the final column is an increasing number that resets for each new ProductLine.
Frameless window aggregation
Here is the execution plan for the query above.
(You can click on the picture to enlarge it).
(As always in these posts, I have added numbers to the operators, equal to their NodeID property, for easy reference in the text below).
As you can see, using the OVER clause on a regular aggregate turns out to be some very nice syntactic sugar for a rather complex set of operators. All of the operators with NodeID 2, 3, 4, 7, 9, 10, and 11 are needed to compute this value. I will not explain the exact way this part of the plan works, because I have already done so in part 6 of the plansplaining series. And if you experiment with the query, you will see that you get the same plan shape regardless of what aggregate function you specify.
Now I am not saying that this is bad. The alternative syntax that we had to use on SQL Server 2000 and older was not only very cumbersome to write and maintain, but also even more costly in performance. When the syntax for frame-less windows was introduced in 2005, we not only got to simplify our queries by a huge amount, we also got a “free” performance gain to boot.
But do not let the simplicity of the syntax lure you into thinking that frame-less windows are almost free. They absolutely have a (relatively high) cost, that you should be aware of.
The ROW_NUMBER() function is much cheaper. Only operators 0 and 1 are needed for this, and they are both very cheap operators. The Segment operator (#1) simply sets a Boolean to False or True depending on whether the value of its Group By column (ProductLine in this case) changes. This does require the input to be sorted, but that has already been done by the Sort operator (#5), which is needed for the final ORDER BY anyway. And the Sequence Project operator (#0) merely increments an internal counter for each row and adds it to the output, or resets that counter back to zero if the segment column that Segment #1 creates is set to True.
Note that the execution plan still looks very similar for the RANK and DENSE_RANK functions (the only difference is the addition of a second Segment operator, to determine when they need to increase their counter). An execution plan with NTILE is much more complex, because the ntile function requires that the value of COUNT(*) OVER (PARTITION BY Partition_columns) is included in the input stream, which necessitates a similar structure as what we just saw above for the windowed AVG computation.
Windows with a frame
In SQL Server 2012, the syntax of the OVER clause was extended to allow an ORDER BY as well as a ROWS or RANGE specification on any OVER that applies to a regular aggregate function. This can be used to add a frame to the window, to specify that an aggregate function should not be applied to the entire window, but only to the rows within the frame.
With this, we can write a query such as this one, to return a moving average column, that shows the average list price of the current product, the one before it, and the two following:
SELECT p.ProductID, p.Name, p.ListPrice, p.ProductLine, AVG (p.ListPrice) OVER (PARTITION BY p.ProductLine ORDER BY p.ListPrice ROWS BETWEEN 1 PRECEDING AND 2 FOLLOWING) AS MovingAverage, ROW_NUMBER () OVER (PARTITION BY p.ProductLine ORDER BY p.ListPrice) FROM Production.Product AS p ORDER BY p.ProductLine, p.ListPrice;
To make sure that we’re all on the same page when it comes to terminology, the PARTITION BY clause still defines the window of rows that applies to each row (so, all rows with the same Productline), whereas the ROWS BETWEEN specification defines a frame that limits the observed rows to only those within the frame.
The execution plan is both more and less complex. It is one straight line of operators calling other operators.
(You can click on the picture to enlarge it).
Many people will conclude from this picture that this is a cheap execution plan. After all, it just reads the table once, right? And then does only some processing on the rows? No extra I/O, no joins. It has to be fast! Or is it?
Unfortunately, the only part of this way too common assumption is that we only read the table once. But there is in fact a lot more I/O. And there is, in fact, even some hidden functionality that could be described as, technically, emulating a join.
So let’s go over this execution plan and see how it works.
Compute row numbers (twice??)
Following the data, the start of this execution plan is quite simple. A Clustered Index Scan (#10) to fetch all Product data, that is then sorted (Sort #9) by ProductLine and ListPrice, the window and frame columns. SQL Server has to use this Sort operator because there is no index that it can use to get the data in this correct order “for free”.
Segment #8 and Sequence Project #7 are unrelated to the framed window for the average ListPrice. They compute the ROW_NUMBER column that we also saw in the previous example, in the exact same way, just at a different spot in the execution plan. Then, to the left of that, is another combination of a Segment and a Sequence Project (#6 and #5) that does … exactly the same! In the Output List of Sequence Project #5, we see the four columns from the Product table that are needed for the SELECT list, plus two computed columns, Expr1002 and RowNumber1005, that both are exactly the same. One is computed because it is included in the SELECT list, the other is part of the standard pattern that the optimizer always generates for framed windows. Apparently the optimizer does not recognize this duplication. A missed opportunity to save some performance. Not a lot, these operators are extremely efficient and very cheap, so the duplication does not cost much. But still, I cannot help looking at this and mentally groaning. Would it really be that hard to make the optimizer recognize cases where the exact same expression is needed multiple times, and assign some internal aliases so that it has to be computed only once? I decided to at least make a feedback item for this potential performance improvement – please consider upvoting it if you also would like to see this optimization in a future version!
Compute the frame boundaries
The processing for framed windows continues with Compute Scalar #4. The Defined Values property shows that two extra columns are computed here: TopRowNumber1006 (as RowNumber1005 – 1) and BottomRowNumber1007 (as RowNumber1005 + 2). The names used for these internal columns are already a hint at their function, and the similarity of the formulas with the frame specification in the query (ROWS BETWEEN 1 PRECEDING AND 2 FOLLOWING) confirm this. TopRowNumber1006 is the row number of the first row in the current frame, and BottomRowNumber1007 specifies the last row. So for, for example, the 4th row in a partition, RowNumber1005 would be 4, TopRowNumber1006 is 3, and BottomRowNumber1007 is 6, which specifies that the average ListPrice to be shown on this row has to be computed for rows 3 through 6 from this window.
Operator Segment #3 once more adds a segment column, Segment1010, that is true whenever a new value in the ProductLine column starts. This was already done twice, for Segment1008 and Segment1009, each time with the exact same specification, and on the exact same input (except a few more columns are in the rows now). So this once more appears to be redundant. And technically that is correct, but less so than in the previous case. The reason is rather technical. The parent operator of Segment #3, Window Spool #2, requires that its direct child has to be a Segment operator. I don’t know whether this is because it would otherwise not recognize which column is the segment column, or whether it is able to recognize segment columns (by name, I guess) but not able to work out which one to use in case there are multiple segment columns in the input. (This is not the case in the example I used here, but I have seen it happen in other execution plans). Either way, it would be possible to fix this. Add a Segment Column property to Window Spool (and various other operators) to indicate which input column specifies where segments start, and it is no longer needed to have the Segment operator as the direct child, which also means a single segment column can be used multiple times if multiple operators need it. But I do recognize that this is a larger engineering effort than “just” recognizing when the same computation is done twice. Even so, I still decided to make a feedback item for this idea as well – so once more, if you agree with me that this would be a nice optimization, click the link and add your vote!
Add the frame
Now that all the preparations are done, it is time to start the actual processing for the framed window aggregation. The key operator here is Window Spool #2. Every operator that has “spool” in its name saves a copy of rows it receives, in order to then return them multiple times. For a Window Spool specifically, the incoming rows are stored in a temporary table with a clustered index on the row number column from the input (RowNumber1005). And it then returns, for each row in the input, a group of rows that starts with that row itself, followed by all rows with row numbers between the top row (TopRowNumber1006) and the bottom row (BottomRowNumber1007), from the same window. So for the first row of any window (as marked by the Segment operator), the row itself is returned, followed by rows 1 through 3. (Note that the top row would be 0, but there is of course no row zero). For the fifth row, Window Spool would return a group that starts with that fifth row, then rows 4 through 7 – unless the window has only five or six rows of course.
The output of the Window Spool is mostly the same columns as its input. The only differences are that the top and bottom row number are not included in the output, and one new columns is added: WindowCount1011. This is simply an integer that is increased when the operator starts to return the next group of rows. (Which, as a reminder, consists of a current row followed by all rows in the frame for that row).
If you look at the Actual Number of Rows for All Executions property of both Segment #3 and Window Spool #2, you see that, indeed, the number of rows goes up. The 504 input rows result in a grand total of 2,500 rows returned by Window Spool. Since Window Spool transforms each input row in a group consisting of the input row itself and then the four rows from its frame, minus one or two rows for the first and last rows of each window. This adds up exactly: 5 * 504 = 2,520, but there are five values of ProductLine, so five first rows that each have one row less, five last rows that have one row less, and five second to last rows that have two rows less each.
Compute the average
The final part of the puzzle is in Stream Aggregate #1. You can already see from the Actual Number of Rows for All Executions property that this operator removes all those extra rows again, to condense the output back to the same 504 rows that we had in most of the execution plan. The Group By property shows that it does this by aggregating on the WindowCount1011 column, that was just added by Window Spool. Remember, Window Spool increases this number when it starts to work on a new input row, and then returns a group of rows, all with the same value in WindowCount1011, that consists of that input row, followed by all of the rows it can see through its window frame.
The Defined Values property of Stream Aggregate #1 is very long, and rather hard to read. However, what it effectively boils down to is that most of the output columns are computed by applying the (internal) aggregate function “ANY” on the corresponding input column. The name of this function suggests that any of the input values would be a valid result, but the implementation of Stream Aggregate ensures that it actually always is the value from the first row of each group. Which, as we know, is the “current” row for the frame of rows that follows. So the end effect is to simply get the current row back from each group of rows.
The only two columns with a different specification are Expr1003 and Expr1004, which are defined as Count(*) and SUM(ListPrice) respectively. As always, when a query asks for an average, the execution plan uses an aggregation operator to compute the number of rows and the sum of the values, to then later divide these (in Compute Scalar #0 in this case) to find the actually requested average.
An extra row?
The critical reader might have noticed something strange, though. For each row, Window Spool returns a group of the current row, followed by its frame. That makes a group of five rows total (okay, four or three for the rows near the window boundaries). So then the value of Expr1003, Count(*), would be 5, even though the defined frame is just four rows: the one preceding row, the current row, and the two following rows. And Expr1004, with the sum of the ListPrice values, would include the ListPrice of the current row twice. This would return incorrect results! So how is that avoided?
The answer is that there is a kind of invisible link between Stream Aggregate #1 and Window Spool #2. Effectively, the Stream Aggregate operator “knows” that its child is a Window Spool, and that results in a slightly modified behavior. The modification, in this case, is to basically ignore the first row when computing aggregations. The ANY aggregation is a bit different, because it is computed when the internal counters are initialized, not when they are updated. So those ANY operators are still taken from this first row. But for all “real” aggregations, such as Expr1003 and Expr1004 in this case, that first row basically is ignored, which avoids counting it twice.
But why even include this row twice in the first place, and then have to add this extra logic to Stream Aggregate to then avoid counting it twice? Well, the answer to that becomes obvious when you consider that it is perfectly legal to for instance specify an OVER clause with ROWS BETWEEN 3 PRECEDING AND 1 PRECEDING. Or ROWS BETWEEN 4 FOLLOWING AND 6 FOLLOWING. (In fact, constructions similar to this are actually used internally in some cases in execution plans for queries that use the LAG or LEAD functions!). In these cases, the current row itself is not even include in the window frame of rows that the aggregate function should operate on. But we still need the values of the current row to be returned in all non-aggregated columns. This is cleanly achieved by always using the first row in each group as the current row only, to be used for ANY but to otherwise be fully ignored by the aggregation functions, and then including an extra copy in the rest of the group if the current row should be included.
The hidden join
The Window Spool, that increases the number of rows from 504 to 2,500, is effectively joining its input data to a copy of that data itself. You could do the same thing in a query, by doing a JOIN (or APPLY) of a source to itself, with a join condition of WHERE a.RowNumber BETWEEN b.RowNumber – 1 and b.RowNumber + 2. Well, except that you’ll need to add some additional SQL gymnastics to then also add the extra copy of the current row, that Window Spool adds at the start of each group. So even though the execution plan is just a single straight line, and you do not see any join operators, there still effectively is a join going on.
In this case, the number of rows goes up by “only” a factor five: the number of rows in each frame, plus one for the current row. But what if a query uses larger frames? Well, in that case, the factor goes up accordingly. Define a frame as ROWS BETWEEN 1000 PRECEDING AND 2000 FOLLOWING, and for sure, the output of the Window Spool will be up to 3,000 rows for each input row. (Less for rows near the window boundaries, and of course also less if your input size is not that large).
Depending on the theoretic maximum number of rows that the spool has to store, this hidden join might actually also cause a lot of extra I/O. The cut-off point here is 10,000 rows. If a window frame is declared with ROWS BETWEEN, and the number of rows between the first and last row of the frame cannot exceed 10,000, then Window Spool will store its worktable in memory only. But for a ROWS BETWEEN specification where the boundary points are more than 10,000 rows apart, or for any RANGE BETWEEN specification, the actual number of rows to store might exceed 10,000. In that case, the worktable will be stored on disk, on tempdb. And repeatedly reading the rows from this table to return them for each next group can incur a high amount of logical reads.
Let’s look at a new query to see the effect of a disk-based worktable.
SELECT p.ProductID, p.Name, p.ListPrice, p.ProductLine, MAX (p.ListPrice) OVER (PARTITION BY p.ProductLine ORDER BY p.ListPrice ROWS BETWEEN 10000 PRECEDING AND CURRENT ROW) AS MovingMax FROM Production.Product AS p;
The execution plan for this query looks very similar to the previous execution plan:
The main difference is that Compute Scalar #0 is not needed (because MAX can be directly computed by Stream Aggregate), and the extra operators to compute the ROW_NUMBER column for the SELECT list are gone, because I don’t request that column anymore. Other differences are in the properties of some operators: Stream Aggregate now only computes a MAX(ListPrice) rather than Count(*) and SUM(ListPrice), and the Compute Scalar now computes the top row number as row number – 10000. The bottom row number is not computed at all: Window Spool only requires a bottom row number in its input if the frame does not end at the current row.
But the biggest difference is not visible in the execution plan at all. If you run this query after running SET STATISTICS IO ON, you will see in the Messages tab that a lot of extra I/O had to be done!
Table 'Worktable'. Scan count 509, logical reads 3198, [...] Table 'Product'. Scan count 1, logical reads 15, [...] Completion time: 2023-11-09T15:41:38.5072710+01:00
One would expect that, in this case, the Window Spool operator would also report these logical reads, in its own Actual I/O Statistics property. However, as you can see in the screenshot on the right, this property is not exposed in this case. I assume that this is due to an oversight on Microsoft’s end, and I have made a feedback suggestion to ask them to add this property to Window Spool in all cases where the worktable is stored on disk. Once more, your upvotes are very welcome!
Now, you may think that just under 3,200 logical reads is not that much. But let’s not forget that this is based on a result set of just 504 rows! Imagine what would happen if you try this on a table that has any serious amount of data!
Or wait. Why imagine, when we can test? The largest table in the AdventureWorks sample database is SalesOrderDetail, so let’s just try what happens if we run a query that uses a framed window aggregate on this table, with a frame that is larger than the 10,000 cut-off point, so that the worktable will be disk-based.
Here is the sample query.
SELECT sod.SalesOrderID, sod.SalesOrderDetailID, sod.ProductID, MAX (sod.ModifiedDate) OVER (ORDER BY sod.SalesOrderID, sod.SalesOrderDetailID ROWS BETWEEN 9999 PRECEDING AND CURRENT ROW) AS MovingMax FROM Sales.SalesOrderDetail AS sod;
Let’s first look at the execution plan only, without executing the query:
Again, just this single long line of operators. Almost identical to the previous plan, except that there is now no Sort at the right-hand side – the optimizer is able to get the data in the required order “for free”, by doing an ordered scan on the clustered index. However, the width of the arrow from Window Spool to Stream Aggregate is slightly concerning. And it becomes much more than just “slightly” concerning when we hover our mouse over that arrow and look at the tooltip:
Yes, you read that right. The optimizer estimates that Window Spool will return over 1.2 billion rows to Stream Aggregate. And no, that is not an estimation error. The input to Window Spool is 121,317 rows, there is no PARTITION BY in the OVER clause so it’s all just a single window, and apart from the first 9,999 rows, each row will be returned along with a frame of 10,000 rows. So that is 10,001 * 121,317 rows, minus a bit for the first 1,000 rows that have a smaller frame. That is, indeed, over 1.2 billion.
Running this query takes a lot of time, simply due to the sheer amount of rows that is produced, passed around, and processed in the aggregation. On my laptop, it took 3 minutes and 41 seconds for the query to complete. But at least there was no I/O for the Window Spool. The maximum size of the worktable is exactly at the cut-off point of 10,000 rows: the 9,999 preceding rows, plus the current row. So all data is stored in memory.
But that changes when I modify the query to read BETWEEN 10000 PRECEDING AND CURRENT ROW. Now, the maximum frame size is 10,001 rows. Too much for the in-memory worktable, so now Window Spool will store its worktable in tempdb. And that increases the execution time of this already slow query even more: to a dramatic 6 minutes and 20 seconds, an over 75% increase! And the SET STATISTICS IO output show us why: almost 7.5 million logical reads on the worktable!
Table 'Worktable'. Scan count 121318, logical reads 7487227, [...] Table 'SalesOrderDetail'. Scan count 1, logical reads 1238, [...] Completion time: 2023-11-09T16:38:59.8254460+01:00
Now, it is not all bad news. The chance that you would ever need a window frame of over 10,000 rows is not really that big. It is far more common to use the WITH UNBOUNDED PRECEDING specification. And while that, in fact, could also theoretically result in frames of more than 10,000 rows, there is one extra special trick, called “fast-track optimization”, that the optimizer can use specifically for the WITH UNBOUNDED PRECEDING case, to make those queries go really fast again. We will look at that in the next part of this series.
We already looked at the execution plan for basic window aggregation in the past. We now also briefly looked at how ranking functions are computed. But most of this post was about how aggregates are computed for window aggregation when a frame is added to limit the scope of the window.
There is far more to be told about the execution plans for window functions and related functions. But that will have to wait until a future post.