The Segment operator, an error in Books Online

The Segment operator, an error in Books Online

If you are as obsessed with execution plans as I am, you must be familiar with this page in Books Online: Showplan Logical and Physical Operators Reference. I must admit that I have mixed feelings about it. My problem with this page is the confusing way in which it mixes physical operators, logical operations (called logical operators on that page to add to the confusion), and operators that only exist in intermediate phases of query optimization and that we will never see in actual execution plans. But I do like it for presenting all operators that exist, with a short description. And though the descriptions are very short and incomplete, they are at least correct. Well … usually.

The Segment operator

Let’s first take a look at a simple query on the AdventureWorks sample database:

SELECT  c.CustomerID,
        c.TerritoryID,
        ROW_NUMBER() OVER (PARTITION BY c.TerritoryID ORDER BY c.CustomerID)
FROM    Sales.Customer AS c;

The execution plan for this query looks simple enough:

The Index Scan operator, which has its Ordered property set to True, is guaranteed to return rows in index key order. That means that the rows come from this operator ordered by TerritoryID and CustomerID, which is the ideal ordering for this query. The rows flow through a Segment operator that we will look at shortly, and then to a Sequence Project operator that computes the ROW_NUMBER() expression (this can be seen in its Defined Values property). The final results are then returned to the client.

The reason that a Segment operator is needed is that Sequence Project is very good at counting rows to find the ROW_NUMBER, but that it has no idea how to apply the PARTITION BY specification that we requested. That’s what the Segment operator is for.

Books Online

The Segment operator, like all operators, is described at the Books Online page mentioned above. Here is the description, quoted verbatim:

Segment is a physical and a logical operator. It divides the input set into segments based on the value of one or more columns. These columns are shown as arguments in the Segment operator. The operator then outputs one segment at a time.

Looking at the properties of the Segment operator, we do indeed see the argument mentioned in this description, in the Group By property (highlighted in the screenshot). So this operator reads the data returned by the Index Scan (sorted by TerritoryID, which is required for it to work; this is why the Index Scan operator is ordered to perform an ordered scan), and divides it into segments based on this column. In other words, this operator is a direct implementation of the PARTITION BY spefication. Every segment returned by the operator is what we would call a partition for the ROW_NUMBER() function in the T-SQL query. And this enables the Sequence Project operator to reset its counters and start at 1 for every new segment / partition.

The problem

Now that we have seen what the operator does, let’s discuss why I consider the description in Books Online to be inaccurate and misleading. The problem is in the last sentence: “The operator then outputs one segment at a time”. It may be a language thing (English is not my native language) in which case this entire post becomes a non-issue – but to me, when I read that sentence I can’t interpret it in any other way as to mean that the Segment operator processes input data until it has a complete segment, and then returns all rows belonging to that segment before returning to the input. So this would mean that it would first read all 3,520 rows that have TerritoryID = 1 from Index Scan before returning the first, then return the remaining 3,519 rows from that segment without calling Index Scan, and then repeat that pattern for the next territory. In other words, Books Online describes Segment as a partially blocking operator.

By default, every operator in an execution plan has some memory available for storing the  current state, the current row, and some other housekeeping data. That memory is not enough to store 3,520 rows. So if the Books Online description is accurate, the Segment operator would need to store these rows somewhere else.

… but where?

Execution plan operators have only two options for storing data beyond their standard small working area. These options are: in memory, or in tempdb.

If an operator has to store extra data in memory, then the optimizer will estimate the amount of memory needed and record that in the execution plan as the memory grant. In an actual execution plan, this memory grant is visible on the top left node. The property is always shown what at least one operator in the execution plan needs additional memory beyond the normal working area.

When an operator uses tempdb to store data, it allocates either a worktable or a workfile in tempdb. All I/O to this structure is then visible if you run a query with SET STATISTICS IO enabled. This is not conditional – when an operator as much as has the option to need to use tempdb, it will always show the worktable or workfile in SET STATISTICS IO. Even when, at run time, the work area was not needed. That’s why any execution plan with a Sort operator always shows a worktable in SET STATISTICS IO, with all numbers equal to zero unless the Sort had to spill to tempdb.

The execution plan for the query used in this post does not show a Memory Grant property on the SELECT node. And there is no worktable or workfile in the SET STATISTICS IO results. So the only logical conclusion is that the operator does NOT store the 3,520 rows of territory 1. Which in turn means that is has to return them as it reads them – Segment is not in any way blocking; it is definitely a fully streaming operator.

In case the above does not convince, here is another way to explain this. The video above shows the query above executing with the Live Query Statistics feature, slowed down extremely so it is easier to follow. You can clearly see that the Segment operator does not do any buffering at all. All rows it receives are immediately passed to the Sequence Project operator; the entire execution plan is flowing and there are no blocking or semi-blocking operators.

What it actually does

Now that we have established that the text in Books Online does not accurately describe how the Segment operator works, let’s try to find out how it does work. A firm clue is available in the property list of this operator, as shown in the screenshot.

The property “Segment Column” reports a column name. And this column, Segment1003 in this case, is included in the Output List property of the operator; this means that this column is included in the rows that are returned to the Sequence Project operator.

This column is the secret key. When the Segment operator reads a row, it compares the values in the Group By columns to those of the previous row. If they have changed since the previous row (or if there is no previous row because this is the first row processed), the Segment Column is set to a specific value (probably 1). For all other rows, the Segment Column has a different value (e.g. 0). The result is that the Segment Column can be used to very easily identify the row that marks the start of a new segment.

The Sequence Project is one of a number of operators that are “segment aware”. Being segment aware means that the operator responds in a smart way to segment columns that are in the input stream. When the value in the segment column marks the start of a new segment, those operators respond to that in a smart way. In the case of Sequence Project, the response to the start of a segment is that it resets its internal counters and restarts the row counting at 1, which results in the ROW_NUMBER() column resetting in the result of the query.

Improved text

Based on this, the text in Books Online should be changed to better reflect how this operator actually works and to avoid confusion over this operator being blocking or not. Here is my proposed replacement text:

Segment is a physical and a logical operator. It divides the input set into segments based on the value of one or more columns. These columns are shown as arguments in the Segment operator. The operator then marks the start of each new segment in a specific Segment Column that is added in the output stream.

Conclusion

Books Online is a fantastic resource. But, like any man-made product, it may contain errors. I recommend using Books Online whenever appropriate. But as with anything you read anywhere: no matter how authoritative the source is, you should never take it at face value. Always use your common sense and never forget that to err is human. Trust, but verify!

T-SQL Tuesday #101: A few of my favorite tools
Plansplaining, part 5. Bitmaps

Related Posts

No results found.

2 Comments. Leave new

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.

Close