After making my post on cursor optimization I received some comments that triggered me to do some further investigation. Adam Machanic wrote in my blog’s comments that using SQLCLR to loop over a SqlDataReader would be much faster than any T-SQL based cursor. And Erland Sommarskog wrote in a usenet thread that he has some colleagues who think that a “poor man’s cursor” is always better than a real cursor. So I decided to give these options a try and see what comes out in a real test. I simply reused the test cases I had already used for testing the various cursor options, but with the code adapted to use SQLCLR or to use a cursor-less iteration.
The poor man’s cursor
I don’t think that “poor man’s cursor” is an official phrase, but what the hay – if we all start using it, we can make it official J. In case you want to know what a term means before using it, the term “poor man’s cursor” refers to any method of iterating over the rows in the result set of a query, processing them one by one, without using the DECLARE CURSOR, OPEN CURSOR, FETCH, CLOSE CURSOR, and DEALLOCATE CURSOR keywords that were added to T-SQL for the sole purpose of iterating over the rows in a result set of a query.
Why would you want to do that, you may ask? Well, I think that the most common reason is that programmers have heard that cursors are generally bad for performance, but fail to understand that the performance impact is not caused by the cursor itself, but by the fact that iterating over a result set reduces the options available to the query optimizer and negates the development team in Redmond has done to optimize SQL Server for set based operations. So they think that the cursor itself is to blame, and try to code around it without moving from their algorithmic, iteration-based approach to a declarative, set-based approach.
Usenet newsgroups and web-forums being full of simple one-liners such as “cursors are evil”, many people claiming that cursors incur a heavy overhead, and even some otherwise respectable websites listing WHILE loops first in a list of cursor alternatives, all have done their fair share to contribute to the popularity of the idea that you can improve cursor performance by simply replacing it with a different iteration mechanism. So, let’s find out if there is any truth to this claim.
Reading data
I started with the fastest of all cursor options, the one using a local, forward only, static, read only cursor with an ORDER BY matching the clustered index. I ripped out all cursor-related command and replaced them with the appropriate SELECT TOP(1) commands to read and process one row at a time, and ended up with this code:
— Keep track of execution time
DECLARE @start datetime;
SET @start = CURRENT_TIMESTAMP;
— Declare and initialize variables for loop
DECLARE @SalesOrderID int,
@SalesOrderDetailID int,
@OrderQty smallint,
@ProductID int,
@LineTotal numeric(38,6),
@SubTotal numeric(38,6);
SET @SubTotal = 0;
— Read first row to start loop
SELECT TOP (1) @SalesOrderID = SalesOrderID,
@SalesOrderDetailID = SalesOrderDetailID,
@OrderQty = OrderQty,
@ProductID = ProductID,
@LineTotal = LineTotal
FROM Sales.SalesOrderDetail
ORDER BY SalesOrderID, SalesOrderDetailID;
— Process all rows
WHILE @@ROWCOUNT > 0
BEGIN;
— Accumulate total
SET @SubTotal = @SubTotal + @LineTotal;
— Read next row
SELECT TOP (1) @SalesOrderID = SalesOrderID,
@SalesOrderDetailID = SalesOrderDetailID,
@OrderQty = OrderQty,
@ProductID = ProductID,
@LineTotal = LineTotal
FROM Sales.SalesOrderDetail
WHERE SalesOrderID > @SalesOrderID
OR ( SalesOrderID = @SalesOrderID
AND SalesOrderDetailID > @SalesOrderDetailID)
ORDER BY SalesOrderID, SalesOrderDetailID;
END;
— Display result and duration
SELECT @SubTotal;
SELECT DATEDIFF(ms, @start, CURRENT_TIMESTAMP);
go
I ran this code five times in a row and calculated average execution time as 3166 milliseconds. I then re-ran the cursor code five times (I didn’t want to use the old measurements, as I was unsure if I had the same applications active – and having a different load on my machine would surely influence results); this code took 3265 milliseconds. So the first round goes to the poor man’s cursor, for beating the “real” cursor by three percent. I must add to this that I have later run another test, as part of research for a future blog post, where the results were reversed and the real cursor beat the poor man’s cursor by a small margin.
Of course, real life is not always so nice as to throw us only problems that require ordering the data by the clustered index key. So my next step was to investigate what happens to the comparison if the problem requires the data to be read in an order that can be served by a clustered index. Remember that I had a similar test case in the cursor option comparison, so I again was able to reuse the existing cursor code for ordering by ProductID. For the poor man’s cursor version, this involved changing the ORDER BY on both queries, but I also had to change the WHERE clause in the second – to make sure that the WHERE clause filters out all rows already processed, I have to include rows with a higher ProductID as well as rows with an equal ProductID and a higher primary key value – and in order for this to work, I also have to include the primary key columns as tie-breakers to the ORDER BY clause. I won’t post the full code, as most of it remains the same, but the “Read next row” query in the loop now reads like this:
— Read next row
SELECT TOP (1) @SalesOrderID = SalesOrderID,
@SalesOrderDetailID = SalesOrderDetailID,
@OrderQty = OrderQty,
@ProductID = ProductID,
@LineTotal = LineTotal
FROM Sales.SalesOrderDetail
WHERE ProductID > @ProductID
OR ( ProductID = @ProductID
AND SalesOrderID > @SalesOrderID)
OR ( ProductID = @ProductID
AND SalesOrderID = @SalesOrderID
AND SalesOrderDetailID > @SalesOrderDetailID)
ORDER BY ProductID, SalesOrderID, SalesOrderDetailID;
The impact on performance is dramatic, to say the least. With this slight modification in the order in which rows have to be processed, the average execution time for five consecutive test runs rises to 5822 ms. The cursor version gets slower as well as a result of the new sort order, but by far less – it still takes only 3377 ms, so the poor man’s cursor is now worse by over seventy percent!
For the final test, I checked the effects of ordering by a column that’s not indexed at all. I did this in the original cursor test by ordering on LineTotal, so I’ll do the same here. Since LineTotal is, like ProductID in the previous test case, not constrained to be unique, the same consideration apply. That means that I can reuse the code of the version that ordered by ProductID except of course that I have to change each occurrence of ProductID to LineTotal.
This change really wrecked performance for the poor man’s cursor. I wanted to sit it out, but finally decided to kill the test after one and a half hours. I finally realized that the LineTotal column I was using is a non-persisted computed column, which adds an enormous amount of overhead – for each of the 121,317 iterations, SQL Server has to recalculate the LineTotal for each of the 121,317 rows – that is a total of almost 15 billion calculations! So I decided to change this test case to sort on OrderQty instead, then left the computer to execute the test run overnight. The next day, the duration was listed as a whopping 27,859,593 ms (seven and three quarter hours!) – just a tad slower than the real cursor, which clocked in at an average execution time of 3430 ms when sorting on the computed LineTotal column and 3352 ms when sorting of OrderQty.
Modifying data
Naturally, I wanted to test the performance of a poor man’s cursor in a modifying scenario as well. I didn’t really expect any surprises. After all, I already know that the fastest cursor solution uses the exact same cursor options as when reading data. I’ll spare you the poor man’s cursor code this time, since it’s based on the cursor code published on my previous blog posts, with the same modifications as above. Since this update scenario happens to be based on ordering by the clustered index key, I expected the poor man’s cursor to be just a tad faster, just as in the reading scenario.
After running the tests, I was surprised. The real cursor version took 5009 ms on average; the poor man’s cursor achieved the same task in just 4722 ms – a speed gain of over five percent. The speed gain was so much more than I expected that I actually repeated the tests – but with the same results. I must admit that I have no idea why the exact same cursor, transformed to the exact same poor man’s cursor, results in more speed gain when the rows are then updated then when they are merely used in a computation.
I did not test performance of the poor man’s cursor in a scenario where the rows have to be processed in a different order than the clustered index key. Based on the results of the tests for reading data, I expect performance to go down the drain in a very similar way.
Conclusion
People claiming that a poor man’s cursor performs better than a real cursor are mostly wrong. When the order in which rows have to be processed does not match the clustered index key, a properly optimized cursor will run rings around the poor man’s cursor.
The only exception is if the required ordering happens to coincide with the clustered index key. In those cases, a poor man’s cursor may sometimes beat a real cursor by a few percent, although there are other cases where the cursor still wins (also by a few percent). Even in the cases where the poor man’s cursor does win, the margin is so small that I’d recommend just using real cursors, with the appropriate options for maximum performance (that is, LOCAL, FORWARD_ONLY, STATIC, and READ_ONLY) in all cases.
Except of course in the 99.9% of all cases where a set-based solution beats the cursor-based one J.
Using the CLR
When you use CLR code to process data provided by SQL Server, iterating over rows to process them one at a time becomes the standard – after all, there is no support for set-based operations in C#, VB.Net, or any other .Net enabled programming language. As such, the claim made by Adam Machanic has valid grounds. A language that has no other option but to iterate over the rows and process them one at a time pretty well should be optimized for this kind of processing!
Reading data
The CLR version of the code to calculate the sum of all LineTotal values is about as simple as it gets:
[Microsoft.SqlServer.Server.SqlProcedure]
public static SqlInt32 ReadData([SqlFacet(Precision = 38, Scale = 6)] out SqlDecimal Total)
{
// Initialize subtotal
decimal SubTotal = 0;
// Set up connection and query
SqlConnection conn = new SqlConnection(“context connection=true”);
conn.Open();
SqlCommand cmd = conn.CreateCommand();
cmd.CommandText = “SELECT SalesOrderID, SalesOrderDetailID, ” +
” OrderQty, ProductID, LineTotal ” +
“FROM Sales.SalesOrderDetail ” +
“ORDER BY SalesOrderID, SalesOrderDetailID;”;
//”ORDER BY ProductID;”;
//”ORDER BY LineTotal;”;
//”ORDER BY OrderQty;”;
cmd.CommandType = CommandType.Text;
// Process rows from reader; accumulate total
SqlDataReader rdr = cmd.ExecuteReader();
while (rdr.Read() == true)
{
SubTotal += (decimal)rdr[4];
}
// Clean up and return result
conn.Dispose();
Total = new SqlDecimal(SubTotal);
return 0;
}
Note that I did not do any dynamic SQL or fancy stuff to make the processing order variable; I just commented one line, uncommented another and recompiled. This avoids dangerous dynamic SQL and complex CASE expressions in the ORDER BY, plus it mimics much better how a real application would work – cursors and other iterative solutions are often used when developers (think they) need to process in a certain order, so the ORDER BY would usually be fixed.
The results of running the tests proved that Adam got it completely right – this CLR implementation does indeed run rings around even the fastest of all cursor solution, taking on average only 1060 milliseconds when ordering by the clustered index, 1072 milliseconds when ordering by the nonclustered index, 1132 milliseconds when ordering by the computed non-indexed column, and 1050 milliseconds when ordering by the non-computed non-indexed column. Three of these are so close together that I think that the differences are within the statistical margin of error and that they should be considered to be the same. The 70 extra milliseconds for ordering by a computed column are obviously the time taken to compute the value for each row in order to do the sorting.
I don’t understand why ordering by the clustered index key doesn’t result in some additional performance gain, as I expected this one to differ from the others by one sort step in the execution plan. This was another test case I repeated a few more times to make sure that I didn’t accidentally mess things up. If I execute the cmd.CommandText as a separate query in SQL Server Management Studio, I do get a significant cheaper execution plan when ordering by the clustered key index, so I guess that this will just have to be filed as one of the many things of SQL Server I don’t understand.
Modifying data
The CLR starts showing a completely different face when you have to modify the data you read from a cursor. The main problem is that you can’t use SqlDataReader anymore, since this blocks the context connection from being used for any other queries. You could choose to open a separate connection, but that has the disadvantage that you perform the updates in a separate transaction context so that you run into a huge blocking risk, plus a rollback of the main transaction would not roll back the changes made from this procedure.
So that leaves me with only one other option – use SqlDataAdapter.Fill method to copy the entire results of the query to a DataSet, then loop over it and process its rows one by one. This results in the CLR version doing a lot more work and using a significant amount of memory. The fact that we no longer update the row we have just read, but rather read them all and only then update them all means that there is also an increased chance that the row is no longer in the data cache and hence has to be read from disk a second time for the update, effectively doubling the amount of physical I/O (though this did not happen in my case).
[Microsoft.SqlServer.Server.SqlProcedure]
public static SqlInt32 ModifyData()
{
// Open connection (context connection since we’re called in-process)
SqlConnection conn = new SqlConnection(“context connection=true”);
conn.Open();
// Prepare commands to fetch rows and to update
String SelCmd = “SELECT SalesOrderID, SalesOrderDetailID, ” +
” OrderQty, ProductID, LineTotal ” +
“FROM Sales.SalesOrderDetail ” +
“ORDER BY SalesOrderID, SalesOrderDetailID;”;
SqlDataAdapter da = new SqlDataAdapter(SelCmd, conn);
String UpdCmd = “UPDATE Sales.SalesOrderDetail ” +
“SET OrderQty = @OrderQty ” +
“WHERE SalesOrderID = @SalesOrderID ” +
“AND SalesOrderDetailID = @SalesOrderDetailID;”;
SqlCommand upd = new SqlCommand(UpdCmd, conn);
upd.Parameters.Add(“@SalesOrderID”, SqlDbType.Int);
upd.Parameters.Add(“@SalesOrderDetailID”, SqlDbType.Int);
upd.Parameters.Add(“@OrderQty”, SqlDbType.SmallInt);
// Read rows to process; copy to DataAdapter
DataSet ds = new DataSet();
da.Fill(ds);
// Process rows
foreach (DataRow dr in ds.Tables[0].Rows)
{
Int32 SalesOrderID = (Int32)dr[0];
Int32 SalesOrderDetailID = (Int32)dr[1];
Int16 OrderQty = (Int16)dr[2];
// Set parameters; perform update
upd.Parameters[0].Value = SalesOrderID;
upd.Parameters[1].Value = SalesOrderDetailID;
upd.Parameters[2].Value = OrderQty + 1;
upd.ExecuteNonQuery();
}
// Cleanup and return
conn.Dispose();
return 0;
}
After compiling and deploying the code above, I once more ran 5 tests. The average execution time for this version was 12,215 milliseconds, almost 150% more than the cursor version. My guess is that this huge increase in time is not a result of the update command itself, but a result of the requirement to pre-load the data in a DataSet and then iterate over that. I did not test it, but I expect to see a similar problem if a cursor requires reading some additional data, based on the data read in the cursor – this, too, would require the CLR version to employ a DataSet instead of simply looping over a SqlDataReader.
Conclusion
Adam’s suggestion to use CLR makes sense, but only for cases where no additional data access, either reading or modifying, is required when processing the rows in the cursor. As soon as the latter becomes a requirement, the CLR version has to switch from using a SqlDataReader to using SqlDataAdapter.Fill, and performance suffers horribly.
5 Comments. Leave new
Yes, DataSets are dogs :(… I actually wrote a lightweight replacement for a client a while back that used a SqlDataReader and loaded the data into arrays internally and exposed a small subset of the properties and methods that a real DataSet / DataTable / DataRow does (i.e., the only ones that most people ever use <g>). Extensive perf testing showed that the custom lightweight version was about 2x as fast to fill, and no slower to retrieve data from, than the Framework’s version. The solution was targeted for .NET 1.1 and I’m not sure if that figure would be valid today, but if you’re really really bored you might try rolling your own to see what happens…
By the way, please vote here for MARS in SQLCLR, which would solve this problem:
https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=312087
For an extra bit of performance in the SQLCLR cursor you can use the CommandBehavior.SequentialAccess in ExecuteReader (). This gives also gives much nicer stream semantics to large data columns.
But how does the CLR read-only-once option compare to the set-based solution?
Adam: You are severely overestimating my abilities in .Net coding if you think that I am able to roll my own DataSet replacement. However, if you send me the code (or the bits), I’ll gladly give my test code another run or two.
I voted 4 for the MARS suggestion. I would have voted 5 if I had the idea that SQLCLR is or should be widely used, but I think it’ll always remain a useful option to have as the ideal solution for a small percentage of problems.
Remus: Good point. I tested with and without this option, and saw no statistically relevant difference (only 2 ms difference after 5 runs for each version).
R.S.Reitz: The set-based solution can start late, pause halfway for a cup of tea, and will still arrive early enough to have dinner ready long before any of the iterative versions arrive. It takes a mere 188 ms.
Nice post mate.
Cheers,
Venkatesan Prabu .J
http://venkattechnicalblog.blogspot.com/