How TOP wrecks performance (part 2)

8 years ago

In my previous post, I showed how the vanilla version of TOP can really help performance by nudging the optimizer into picking an execution plan that is optimized for returning just a few rows instead of the entire result set. But I also showed that adding WITH TIES causes a cardinality misestimate, which may lead to suboptimal or downright terrible plan choices. Luckily, this is fixed in SQL Server 2014 and later, provided you use the new cardinality estimator.

In this post I will show how the PERCENT option causes even worse performance, and how you can avoid that performance hit.

PERCENT

Adding PERCENT to a TOP clause does exactly what it states on the label. Here is an example – very similar to the example used in the previous part, but I changed the ORDER BY for reasons I will explain later in the post.

SELECT TOP(2) PERCENT CustomerID, OrderDate, SubTotal

FROM Sales.SalesOrderHeader

ORDER BY SalesOrderID;

This query returns 630 rows. Why? Because without the TOP clause, the query would return 31,465 rows (the number of rows in Sales.SalesOrderHeader), and two percent of that is 630 (rounded up – TOP PERCENT always rounds up). Unlike TOP without PERCENT, the value for TOP with PERCENT can be a fractional number, e.g. TOP (2.5) PERCENT in the previous query would return 787 rows.

Because SQL Server does not know how many rows the query without TOP would return, it can only satisfy the TOP PERCENT requirement by first executing the query as if there is no TOP to get that row count (which, for this simple query, means reading the entire table), and then return the required number of rows. Here is the execution plan for the query above:

Even though I said that SQL Server needs to first read the entire table to get the amount of rows and then return to the table to get the 630 expected rows, the plan shows just a single clustered index scan operator accessing the table. So how does this work? The answer is the table spool operator.

Let’s step through the plan, starting (as SQL Server does) at the top left operator. The SELECT operator will call the Top operator to get a row, and Top will immediately relay that GetNext row request to the Table Spool operator. Because this spool operator is running in eager spool mode, it will call the Clustered Index Scan operator not once, but over and over again until all rows are read. All rows returned are then “spooled” by the Table Spool operator – i.e., they are written to a worktable, a temporary data structure located in tempdb, from where they can later be retrieved and served (while preserving the order). The Clustered Index Scan operator itself is running as an ordered scan (as shown in the operator properties), which is the reason why this query does not need a Sort operator to satisfy the ORDER BY clause.

Once all rows have been read from the clustered index and stored in the worktable, the Table Spool will return the first row to the Top operator. And, even though this is not visible in the plan, the Top operator also gets access to the total number of rows that is now stored in the worktable. This allows it to compute the number of rows it should return, and then call the Table Spool operator that amount of times. For each of those calls, the Table Spool will not call out to the Clustered Index Scan operator again, but simply read the next row from the worktable and return it.

At first sight this table spool may appear to be superfluous. Why not, after getting a count of rows from the table, get back directly to the base table to get the required results? In the case of this simple query that would indeed make sense. But most queries are more complex: they join a bunch of tables together, filter out part of the results based on a complex WHERE clause, perhaps also do grouping and aggregating to get the final result – those queries can be very expensive and long-running, and in that case you really do not want SQL Server to execute the query first in its entirety to just get a row count, and then execute it again to return the requested rows. The Table Spool pattern is used to prevent that double cost, and the overhead of creating the worktable is considered small enough to not bother using a different pattern for simpler queries.

But there is a catch, and it is a big one. The optimizer seems to think that storing rows in a temporary table is “relatively” cheap, but in reality this turns out to be not the case. The mere process of storing those rows is in reality very expensive. You can see that by rerunning the query, with the STATISTICS IO set option enabled:

SET STATISTICS IO ON;

SELECT TOP(2) PERCENT CustomerID, OrderDate, SubTotal

FROM Sales.SalesOrderHeader

ORDER BY SalesOrderID;

Switch to the Messages tab after executing this, and you will see that the SalesOrderHeader table has 689 logical reads – a full scan of the entire table, as expected. But you will also see 63,730 logical reads for the worktable – that seems a bit excessive for returning just 630 rows!

Spool, the incredible performance sink

It took me some time before I realised that these logical reads are not produced when reading from the worktable, but when writing to it. I was able to verify this theory by changing the TOP(…) PERCENT parameter to either a very high value (e.g. 99.99999), so that all rows from the table are returned, or to a very low value (e.g..0.00001) to return just a single row. The logical reads change, but only a bit: the number of logical reads on the worktable varies from 63,727 to return a single row to 63,888 to return them all. This indicates that over 63,720 reads are generated while building the spool, and between just one or two up to 170 or so to read up to 31,465 rows from it.

In order to understand why so many logical reads are generated when populating the spool, I tried variations on the same query. When I added a WHERE clause to limit the number of rows going into the spool, I found that the number of logical reads decreases by two for every row filtered out. When I added more columns to the SELECT list, or used larger tables as the source (e.g. SalesOrderDetail, or a cross join between two tables), that number at one point changes to a difference of three logical reads per row removed. Upon seeing these ratios, I finally figured out what is causing these excessive reads – apparently, the worktable is created with an internal clustered index on top of it (even if the results stored in it can contain duplicates), and every single insert is performed by doing a seek in that index to find the location to add the row. For tiny worktables that seek starts at just a single read, but as the worktable grows, more levels have to be added to the clustered index, causing the number of reads per row to go up to 2, 3, or even more.

The first rewrite

If the cost of writing to a spool is one, two, or even three logical reads per row inserted, then the overhead it introduces in the plan for the simple query above is way more than the saving. A much better and faster solution is to use a “Do it yourself” TOP PERCENT workaround, as follows:

DECLARE @cnt bigint;

SET @cnt =

(SELECT COUNT_BIG(*)

FROM Sales.SalesOrderHeader);

SELECT TOP(CAST(CEILING(2.0 * @cnt / 100) AS int))

CustomerID, OrderDate, SubTotal

FROM Sales.SalesOrderHeader

ORDER BY SalesOrderID;

This method avoids the overhead of the worktable. An additional benefit is that the number of matching rows is now found using COUNT(*), giving the optimizer more freedom to use a different plan. In this case, it does indeed choose to use a smaller index for the first query, reducing the number of logical reads. The total amount of work for this batch is now down to 57 logical reads for the first query and 18 logical reads – a huge reduction from the 689 logical reads from the base table plus 63,730 logical reads from the worktable we started with.

But the proof of the pudding is in the eating, and the proof of the query is (for me) in the actual time used. Switching to SET STATISTICS TIME ON and running both queries several times shows that I did indeed achieve a nice performance improvement. The original query consistently takes approximately 300 ms elapsed time and burns 180 ms CPU time; the combined total of the two queries in the rewrite is just 150 ms elapsed time, and uses a mere 15 ms CPU time.

There are drawbacks too, though. One is that the query now has to be repeated twice. The SELECT clause in the first is simplified to COUNT(*), but all the remaining logic has to be exactly copied – which makes this a time bomb for future maintenance. This can be slightly remedied by using a CTE, but this makes it a lot harder to understand the logic:

WITH TheResults AS

(SELECT SalesOrderID, CustomerID, OrderDate, SubTotal

FROM Sales.SalesOrderHeader)

SELECT TOP(CAST(CEILING(2.0 * (SELECT COUNT_BIG(*) FROM TheResults) / 100) AS int))

CustomerID, OrderDate, SubTotal

FROM TheResults

ORDER BY SalesOrderID;

This version uses the exact same amount of logical reads as the two queries from the other version combined, and almost the same amount of time – but it will probably be a challenge to everyone who has never seen this query pattern before.

Another drawback of this method is that under high concurrency, the population of the table may change between the two queries. The harder to maintain single query makes the window of risk smaller, but is not immune – if this is an important requirement, you need to elevate the transaction isolation level to serializable, which will have a negative impact on concurrency.

The second rewrite

The method above is not always a good idea. We avoid using the spool by forcing SQL Server to execute the query twice (the CTE variation removes the duplication from the query itself, but not from the execution plan!). That is an improvement over the relatively high cost of the spool, but only if some conditions are met.

If the optimizer can simplify the first execution of the query (to get the rowcount) to a very cheap plan, and if the percentage returned (and hence the number of rows for which the full rowset has to be produced) is small, then the saving can be huge. But what if we try this with a relatively expensive query, that does not allow simplification to get the rowcount, and we request a high percentage?

The query below demonstrates just that. In order to get a sufficiently expensive query on a small database such as AdventureWorks I had to go a bit fancy so this query will not be very realistic – but on much bigger databases, even more realistic queries can quickly become just as or even more expensive.

SELECT TOP(10) PERCENT

soh.CustomerID, soh.OrderDate, soh.SubTotal, Next1KSales.Total

FROM Sales.SalesOrderHeader AS soh

CROSS APPLY

(SELECT SUM(sod2.LineTotal) AS Total

FROM Sales.SalesOrderHeader AS soh2

INNER JOIN Sales.SalesOrderDetail AS sod2

ON sod2.SalesOrderID = soh2.SalesOrderID

WHERE soh2.SalesOrderID BETWEEN soh.SalesOrderID

AND soh.SalesOrderID + 1000) AS Next1KSales

WHERE Next1KSales.Total > 500000

ORDER BY soh.SalesOrderID;

This query takes roughly 118 seconds to run (both elapsed time and CPU time). In the STATISTICS IO results, I see 1,347,920 logical reads on SalesOrderDetail, 801,607 on SalesOrderHeader, and 61,042 on a worktable – this is for storing the nearly 30,000 rows that qualify the WHERE clause, of which then 10 percent (nearly 3,000 rows) are returned. Just as in the previous example, the number of logical reads for filling the spool is excessive so let’s try to avoid this. In this case I choose not to duplicate the query in my code, so that means I have to resort to the more esoteric version with a CTE:

WITH TheResults AS

(SELECT soh.SalesOrderID, soh.CustomerID, soh.OrderDate,

soh.SubTotal, Next1KSales.Total

FROM Sales.SalesOrderHeader AS soh

CROSS APPLY

(SELECT SUM(sod2.LineTotal) AS Total

FROM Sales.SalesOrderHeader AS soh2

INNER JOIN Sales.SalesOrderDetail AS sod2

ON sod2.SalesOrderID = soh2.SalesOrderID

WHERE soh2.SalesOrderID BETWEEN soh.SalesOrderID

AND soh.SalesOrderID + 1000) AS Next1KSales

WHERE Next1KSales.Total > 500000)

SELECT TOP(CAST(CEILING(10.0 * (SELECT COUNT_BIG(*) FROM TheResults) / 100) AS int))

CustomerID, OrderDate, SubTotal, Total

FROM TheResults

ORDER BY SalesOrderID;

Looking at the STATISTICS IO results, I see that I did indeed avoid the worktable and its 61,042 logical reads. But I also see that I have to pay a severe penalty: the number of logical reads for SalesOrderDetail has gone up by 162,901 to 1,510,821, and on SalesOrderHeader the reads have increased by 77,795 to 879,402. So in order to avoid just over 61 thousand logical reads, we had to take a hit of over 240 thousand logical reads on other tables. Does not sound like a winning proposal to me.

The run times of the second query are a bit strange. For reasons beyond the scope of this post the first query does not use parallelism; the second does go parallel, which improves the elapsed time quite a bit, at the cost of much higher CPU usage. In order to get a better comparison I added a MAXDOP hint to force both plans to be serial. This does not change the logical reads, but it does change the execution times: the second plan now uses 125 seconds (both elapsed and CPU time), approximately a 6% increase over the query that simply uses TOP.

Luckily, there is yet another way to rewrite the query. A method that is actually almost the same as the execution plan chosen for the original query, except that we force SQL Server to replace the implicit temporary work area of the spool with an explicit temporary table. So I rewrite the original single query as two queries. I first execute the original query without any TOP clause and store the results in a temporary table; I then return the TOP(..) PERCENT rows from that temporary table using the original rewrite – which is now efficient because it is no longer expensive to fetch the same rows multiple times.

SELECT soh.SalesOrderID, soh.CustomerID, soh.OrderDate,

soh.SubTotal, Next1KSales.Total

INTO #TheResults

FROM Sales.SalesOrderHeader AS soh

CROSS APPLY

(SELECT SUM(sod2.LineTotal) AS Total

FROM Sales.SalesOrderHeader AS soh2

INNER JOIN Sales.SalesOrderDetail AS sod2

ON sod2.SalesOrderID = soh2.SalesOrderID

WHERE soh2.SalesOrderID BETWEEN soh.SalesOrderID

AND soh.SalesOrderID + 1000) AS Next1KSales

WHERE Next1KSales.Total > 500000

OPTION (MAXDOP 1);

SELECT TOP(CAST(CEILING(10.0 * (SELECT COUNT_BIG(*) FROM #TheResults) / 100) AS int))

CustomerID, OrderDate, SubTotal, Total

FROM #TheResults

ORDER BY SalesOrderID;

DROP TABLE #TheResults;

The query above produces the same result as all previous queries, but now in the fastest possible way. Of course the first query still has to do the work of evaluating the entire query which as always necessitates 1,347,920 reads on SalesOrderDetail and 801,607 reads on SalesOrderHeader; these cannot be avoided. (Unless we put in the work to completely rewrite the query, which is probably very well possible in this case but would defy the point of this blog post). There are however no logical reads for inserting into the temporary table, because in this case the temporary table is a heap so SQL Server does not have to search the right location when inserting a row.

The second query uses just 374 logical reads to read the data from this temporary table twice. There is a hidden cost, though, that you can see in the execution plan: because the data in a heap is not sorted, the data has to be read completely twice; the first time in order to count rows and the second time in order to feed them all into a Sort operator. These operators typically do not cause I/O activity because the sort is done in-memory; however this does cost a lot of CPU, and a lot of memory.

The query above forces serial execution for the first query, for better comparison to the other versions. It uses approximately 118 seconds – similar to the query with TOP, which is not very surprising because the overhead of the spool operator, though large in relation to the number of rows it has to store, is insignificant in relation to the total amount of work done for the rest of the query. When running both queries multiple times and calculating the average elapsed and CPU time, I expect the query with TOP to be a few tenths of a second slower than the version that inserts into the temporary table. When I allow this query to go parallel, it runs much faster on my system (only 48 seconds elapsed), at the cost of burning way more CPU (over 350 seconds CPU time on my 8-core test system). I would allow the parallelism if this query is critical or if my server has sufficient unused CPU cycles; on a heavily used systems that is constantly under CPU pressure I would force the serial plan.

The second query, despite the sort, takes almost no time: just a quarter of a second elapsed time, and less than a tenth of a second CPU time.

It always depends

At the start of this post I promised that I would explain why I change the ORDER BY for this post from the ORDER BY that I used in the previous post. The reason is obvious as soon as you run the query below:

SELECT TOP(2) PERCENT CustomerID, OrderDate, SubTotal

FROM Sales.SalesOrderHeader

ORDER BY CustomerID;

The SET STATISTICS IO results do still include a worktable, but in this case the number of logical reads in that table is zero. Is this some hidden magical Table Spool optimization at work, or is there a different explanation? To answer that question we will have to look at the execution plan for this query:

As you can see, there is no Table Spool at all. It is replaced by a Sort operator. There are some surprising similarities between the Table Spool (Eager Spool) in the first plan and the Sort operator in this plan. Both will consume the entire input before even producing the first row of output. As a result, they both know how many rows there are, and they can (and will) both transmit this information to the Top operator so that it can compute the number of row for the specified percentage. And this is why no additional Table Spool is needed in the plan above, because all rows are effectively already “spooled” in the Sort.

There are also two important differences between these operators; the first being that the Table Spool preserves the order of the rows and the Sort operator (obviously) doesn’t. But in the context of this blog I am mostly interested in the second difference: the Table Spool stores the data it reads in a worktable in tempdb, but the Sort stores it in memory (using tempdb only when the allocated memory is insufficient). And that is why we don’t see any I/O on the worktable in the last example, and why this query, despite the much higher estimated cost, in reality runs a lot faster than the first example.

Conclusion

For a query that uses TOP with the PERCENT option, the optimizer needs a way to know the number of rows before it can return the correct subset. This requires evaluating the entire query as if there were no TOP, storing the results, and then returning rows from that stored result set. If a Sort operator is needed to satisfy the ORDER BY of the query then this Sort operator gives us the facility to store and count the rows “for free”; in all other cases the optimizer will inject a Table Spool operator in the plan for this purpose.

However, the Table Spool is in practice way more expensive than what the optimizer thinks it costs. In many cases, it would actually be more efficient to first count the number of rows in the result without saving them, and then run the query again with a Top operator to limit the number of rows returned. Unfortunately, because the optimizer does not fully grasp the actual cost of a Table Spool, this optimization can only be achieved by rewriting the query. In this post I have showed two patterns for this; one that is relatively easy to understand but requires duplicating the code, and a second pattern that doesn’t duplicate code by using a much harder to understand construction.

Sometimes, especially when returning a large percentage of the results from an expensive query that cannot be simplified when only the count of rows is requested, the rewrite pattern I provided backfires. In such cases an alternative form of rewrite is possible where the results of the entire query are first stored in a temporary table, and then the first form of the TOP (…) PERCENT rewrite is used with that temporary table as the source. Though this form is still somewhat faster than just using TOP (…) PERCENT, it does not save quite as much as the other rewrites do in other conditions, so this form should only be used when you have proven that you need it.

And obviously, performance is not the only relevant metric for your code, and often far from the most important metric. This is clearly a case where there is a trade-off between performance and maintainability. If you need to use TOP (…) PERCENT on a resultset that will never measure more than just a few thousand rows, then the performance difference is probably not enough to warrant the rewrite. But if you find that one of your key queries is using TOP (…) PERCENT on a huge resultset and the query does not perform as well as you would like, then hopefully the content of this blog post has given you the tools to tackle that problem.

(And please, for sake of the mental sanity of whoever gets to inherit your code – if you do decide to use any of the rewrite patterns I provided, then please add lots of comments to your code, to explain exactly what you do, and why you are doing it).

Hugo Kornelis

Hugo Kornelis is an established SQL Server Community expert. He writes, blogs, speaks, tech edits, and researches, focusing mostly on SQL Server performance and execution plans. He was the technical editor for the third edition of Grant Fritchey's "SQL Server Execution Plans". In 2018 he started a project to document all behaviour of SQL Server execution plans at his website, "the SQL Server Execution Plan Reference" (https://sqlserverfast.com/epr) When not working for the community, Hugo is busy at his day job: freelance database developer/consultant. Hugo has over 20 years of experience on SQL Server in various roles. He has a strong database design background but has since specialized into query tuning and execution plans.