Window Spool

Introduction

The Window Spool operator is one of the four spool operators that SQL Server supports. Like other spool operators, it retains a copy of data it receives and can then return those rows as often as needed. The specific functionality of the Window Spool operator allows it to replay rows within a window, as defined in a ROWS or RANGE specification of an OVER clause.

The other spool operators supported by SQL Server are:

Table Spool, the basic spool operator. It simply stores data and returns it multiple times, with no special logic. It does have the ability to return the same data in multiple branches of an execution plan, something other spool operators cannot do.
Index Spool, which builds a temporary index on the saved data. This enables it to return specific subsets of the stored data efficiently. A special case scenario in an Index Spool supports stack spools, for recursive queries.
Row Count Spool is optimized for specific cases where the rows to be returned are empty. It doesn’t store the rows it receives, but merely stores a counter.

The Window Spool operator is typically found in execution plans for queries that use the windowing extensions (ROWS or RANGE) of the OVER clause.

Visual appearance in execution plans

Depending on the tool being used, a Window Spool operator is displayed in a graphical execution plan as shown below:

SSMS and ADS (current versions)	Legacy SSMS	Plan Explorer	Paste The Plan

Algorithm

The basic algorithm for the Window Spool operator is as shown below:

Note that some of the boxes here represent more than a single action. Also note that, depending on the arguments used in the OVER clause, some of these actions can change significantly. All of this is detailed in the sections below.

Algorithm variations

As indicated above, the exact specification of the window of rows that should be visible in an OVER specification influences how the algorithm behaves. The following variations exist:

Algorithm for ROWS specification without UNBOUNDED

The basic algorithm, as outlined in the flowchart above, is for a window specification that uses the ROWS keywords, and that does not use UNBOUNDED. The most typical for is BETWEEN ROWS n PRECEDING AND m FOLLOWING, where often CURRENT ROW is used as a synonym for 0 PRECEDING (or 0 FOLLOWING). The values n and m cannot be negative, but the equivalent can be achieved by using either PRECEDING or FOLLOWING to specify both start and end of each window.

For this basic version, the Window Spool will for each row output that row, followed by all rows in the window it sees. If the current row is included in the window, it will be returned once: as the (first) current row, and then again as one of the rows in the window. But note that it is possible to have window specifications where the current row itself is not included in the window.

The parent operator (which appears to be always a Stream Aggregate) is aware that the first row of each window has to be treated slightly differently, to prevent counting its values if the row is not included in the window, or double counting it if it is.

The optimizer always ensures that these columns are present in the input:

A Segment column, created by a Segment This column is used to detect when the columns of the PARTITION BY clause for the OVER expression change. (If there is no PARTITION BY clause in the OVER expression, there still needs to be a Segment column. In this case that column will mark the entire input as a single segment).
A RowNumber column, created by a Sequence Project operator. This is simply a row_number within the current partition.
A TopRowNumber column, computed by a Compute Scalar operator as the RowNumber column minus the number of rows PRECEDING (or plus the number of rows FOLLOWING) for the start of the window (note that CURRENT_ROW is equivalent to 0 PRECEDING).
A BottomRowNumber column, computed by a Compute Scalar operator as the RowNumber column plus the number of rows FOLLOWING (or minus the number of rows PRECEDING) for the end of the window (note that CURRENT_ROW is equivalent to 0 FOLLOWING).

Algorithm for ROWS specification with UNBOUNDED PRECEDING (“fast-track optimization”)

For the rather common case of an OVER clause that specifies ROWS WITH UNBOUNDED PRECEDING AND something, an optimized version of the algorithm, called “fast-track optimization” is used. This optimization is basically a cooperation of a behavior change of Window Spool and a behavior change of its parent Stream Aggregate operator. The Window Spool does not replay the entire window for each row, and the Stream Aggregate does not reset the aggregates it has, but keeps them as a running value and only applies the new rows for each window.

Note that, in some cases, a window specification that does not use UNBOUNDED PRECEDING can be transformed internally into an alternative but equivalent form that does have UNBOUNDED PRECEDING and that hence can uses this form of window specification can be transformed into an alternative form that can benefit from fast-track optimization. This transformation is described in more detail below.

For this case, the optimizer ensures that the input rows include these columns:

A Segment column, created by a Segment This column is used to detect when the columns of the PARTITION BY clause for the OVER expression change. (If there is no PARTITION BY clause in the OVER expression, there still needs to be a Segment column. In this case that column will mark the entire input as a single segment).
A RowNumber column, created by a Sequence Project operator. This is simply a row_number within the current partition.
Optionally, a BottomRowNumber column, computed by a Compute Scalar operator as the RowNumber column plus the number of rows FOLLOWING (or minus the number of rows PRECEDING) for the end of the window. If the end of the window is defined as CURRENT_ROW (or as any of its equivalents 0 FOLLOWING or 0 PRECEDING), then there is no BottomRowNumber column in the input; the operator processes this as if BottomRowNumber is equal to RowNumber for all rows.

Algorithm for RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

When a window is specified using RANGE instead of ROWS, only UNBOUNDED PRECEDING, CURRENT ROW, and UNBOUNDED FOLLOWING are allowed to define the window boundaries. All variations that use UNBOUNDED FOLLOWING are internally converted to something else as detailed in the sections below. So the only “real” range-based window specification is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.

All rows that have the same value in the ORDER BY columns use the same window. To save performance, the same optimization between Window Spool and Stream Aggregate that is used for fast-track optimization is also used, in a slightly different way, for RANGE specifications. After a change in the ORDER BY columns, Window Spool returns that row, followed by all (read ahead) rows with the same ORDER BY columns, so Stream Aggregate adds those rows to the running totals is has. For the rest of those rows, Window Spool then outputs only the row itself, so Stream Aggregate resets the individual column values but leaves all aggregations unchanged.

This version of the algorithm expects only two specific columns in the input rows:

A Segment column, created by a Segment This column is used to detect when the columns of the PARTITION BY clause for the OVER expression change. (If there is no PARTITION BY clause in the OVER expression, there still needs to be a Segment column. In this case that column will mark the entire input as a single segment).
A second Segment column, used to detect when the any of the columns in the combined PARTITION BY and ORDER BY clause for the OVER expression change.

Algorithm for specifications with UNBOUNDED FOLLOWING

If a query uses a ROWS or RANGE specification with UNBOUNDED FOLLOWING, the optimizer reverses the specified sort order and swaps and reverses the start and end specifications. The end result is exactly equivalent, but uses UNBOUNDED PRECEDING. This allows the Window Spool operator to then use any of the three algorithms described above.

Algorithm for specifications with UNBOUNDED FOLLOWING and UNBOUNDED PRECEDING

A query that specifies ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING or RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING is logically equivalent to a query that uses only PARTITION BY, without ORDER BY and ROWS or RANGE specification. The optimizer makes this transformation and then produces a query plan that uses no Window Spool operator. So there is no algorithm in Window Spool for this situation.

Worktable

The Window Spool operator always stores data in a worktable. This worktable may be either stored in memory, or on disk (in tempdb). When stored on disk, the worktable is stored as a clustered index on the RowNumber column in the input. The structure of the memory-based worktable is, at this time, unknown. The disk-based version of the worktable may not always actually hit the disk (it uses the buffer pool, like any other disk-based table in SQL Server), but even if it doesn’t, it’s still heaps slower than the memory-based worktable.

The memory-based worktable is limited in size to at most 10,000 rows. The choice which version is used is based on the window definition only, not on the actual amount of data processed by the execution plan. So even when you work with small tables, if the window specification could theoretically result in more than 10,000 rows in a window, the operator will use the disk-based worktable.

Concretely, this means that any window based on a RANGE specification will always use the disk-based worktable, because there is no theoretic maximum to the number of rows that can have the same values in the ORDER BY columns. For a ROWS specification, a disk-based worktable is used if the window starts at 10000 or more PRECEDING, ends at 10000 or more FOLLOWING, or is BETWEEN n PRECEDING AND m FOLLOWING, with n + m + 1 is 10001 or more.

Add read-ahead rows to worktable

When the Window Spool operator reads the first row of a partition (indicated by the Segment column in the input), it immediately reads all rows that are included in the window for that first row. For a ROWS based window specification, these are all rows with their RowNumber column less than or equal to the BottomRowNumber of the current row. For a RANGE based window specification, the read-ahead reads and stores all rows until the second Segment column indicates a change in the ORDER BY keys.

For each next row in the same partition, the same logic is repeated. In the case of a ROWS based window specification, this typically involves reading just a single row (unless the read-ahead process is already past the end of the window). For a RANGE based window specification, no extra rows are read until a row with second Segment column set is the current row; at that point the entire next set of rows for that segment is read,

Return rows in window

For every row in the input, Window Spool normally returns first the row itself, then all the rows that are included in the defined window; for some window specification there can be rows that have no rows at all in their window.

For the base scenario of a ROWS specification without UNBOUNDED, the window includes all rows with a RowNumber column between BottomRowNumber and TopRowNumber of the current row. For a disk-based worktable, the clustered index allows the operator to quickly find these rows. For the memory-based worktable, I assume the internal structure allows effective access to these rows as well.

For the scenario of a ROWS specification that starts at UNBOUNDED PRECEDING (the fast-track optimization scenario), the operator does not repeat the entire window; it only repeats the row to be added to the window. This is the row with RowNumber equal to the BottomRowNumber of the current row. When the input has no BottomRowNumber column, then the row repeated is exactly the current row.

For a RANGE specification, the window is always equal to all rows in the worktable, so this step simply reads and returns all of them.

Remove rows no longer needed

Once the starting point of the window has moved past a row and that row has been processed as current row, the row will not be returned anymore and hence is no longer needed.

When using a disk-based worktable, the Window Spool operator does not remove these rows. The clustered index on the worktable allows the operator to access rows needed directly. Removing this data would reduce the amount of data used by the worktable in tempdb, but at the price of more processing. Apparently Microsoft decided against this.

For memory-based worktables, storage space is limited. Cleaning up no longer needed rows and reusing the storage space they took is critical in this case. After processing a row, any rows that have a RowNumber that is less than or equal to the lowest of the RowNumber and TopRowNumber of the current row can be removed from the worktable. For the fast-track optimization scenario, this changes to the lowest of the RowNumber and the BottomRowNumber (if included).

Whether the data in the worktable is actually removed is not known at this time. One possible alternative is to store each row in a location that is computed as the remainder of dividing its RowNumber by 10,000, which creates an in-memory table of 10,000 rows that logically wraps around to the start after reaching the 10,000^th row, overwriting whatever was there (which is by definition not needed at this time). Another alternative is to use this same method but precompute the maximum number of needed rows and use the remainder after dividing by that number (which is the same logical concept, but with a smaller memory footprint).

Fast-track optimization transformation

The normal algorithm of a Window Spool returns, for each row, the current row and then the entire window of the rows it sees. With fast-track optimization, this is true only for the first row of each partition; for all others only the current row and then the single row (if any) added to the window are returned. This is a huge performance gain. So huge, in fact, that the query optimizer will actually try to rewrite queries that do not immediately qualify for fast-track optimization into a form that does.

This rewrite is not possible if one or more of the aggregate functions used on the windowed data is a MAX or MIN function. It is also not possible if the window is specified with RANGE. And it doesn’t provide enough benefit if the maximum number of rows in the window is 4 or less. In all those cases, no rewrite is done and the standard functionality is used.

But when ROWS is used to specify a window of 5 or more rows, and the aggregate functions do not include MIN or MAX, then the optimizer will rewrite an aggregate with a window that does not allow for fast-track optimization to an equivalent form that uses two aggregates with two different window definitions that both DO qualify for fast-track optimization. The rewrite is based on the fact that all aggregates except MIN and MAX (which are already excluded) are cumulative. So this means that for each average, the result of applying an aggregate over the window defined by ROWS BETWEEN startpoint AND endpoint is always equal to the result of applying that aggregate over the window defined by ROW BETWEEN UNBOUNDED PRECEDING AND endpoint and then subtracting that aggregate over the window defined by ROW BETWEEN UNBOUNDED PRECEDING AND (one row before startpoint), with appropriate handling for the NULL values returned from aggregating an empty window.

The result will be an execution plan that at first sight appears to use many more Segment, Sequence Project, and Window Spool operators than the query seems to justify, but because these are all able to use fast-track optimization, it’s still the faster option.

Operator properties

The properties below are specific to the Window Spool operator, or have a specific meaning when appearing on it. For all other properties, see Common properties. Properties that are included on the Common properties page but are also included below for their specific meaning for the Window Spool operator are marked with a *.

Property name	Description
Defined Values *	Even though the Window Spool operator computes a new column and adds it to its output, the operator does not show a Defined Values property for this column.
Output List *	The Output List property of a Window Spool typically includes all its input columns except the RowNumber, TopRowNumber, and BottomRowNumber that are used for its processing. It also adds a new column, called WindowCountnnnn (where nnnn is a 4-digit number that is unique within the execution plan). The main text describes how this column is computed.

Implicit properties

This table below lists the behavior of the implicit properties for the Window Spool operator.

Property name	Description
Batch Mode enabled	The Window Spool operator supports row mode execution only.
Blocking	The Window Spool operator is non-blocking. Note that when a window includes “FOLLOWING” rows, then it obviously does have to read all following rows in the window before it can return a row, which (in the case of a very large amount of following rows) can be perceived as limited blocking.
Memory requirement	The Window Spool operator can store its worktable in memory or in tempdb. The in memory version is only used when the window specification in the query guarantees a maximum of 10,000 rows in any window. The amount of memory needed for this is low enough that it does not increase the query’s Memory Grant.
Order-preserving	The Window Spool operator is fully order-preserving.
Parallelism aware	The Window Spool operator is not parallelism aware.
Segment aware	The Window Spool operator requires segmented input, as described in the main text.