There are two ways to test how your queries behave on huge amounts of data. The simple option is to actually use them on huge amounts of data – but where do you get that if you have no access to the production database, and how do you store it if you happen not to have a multi-terabyte storage array sitting in your basement? So here’s the second best option: you cheat.
Luckily, SQL Server has a feature that allows you to do just that. I must warn you that this feature is undocumented and unsupported. You should never under any circumstances use this feature on a production server, unless explicitly directed to by qualified Microsoft staff. Using it, sparingly, on a test box is okay. But as an undocumented feature, there is no guarantee that it will always work, or that it will continue to work in future versions.
With that disclaimer out of the way, let’s take a look at the “cheat” option. As you probably know, the Query Optimizer relies on statistics for its decisions. If those statistics are wrong, the decisions will probably be wrong. Here is where we find the leverage to cheat: if we can force SQL Server to use statistics that are representative of the production database rather than the test database, we will get the execution plan it would generate on production. The actual performance will probably still be lots faster, but at least we can look at the plan and use our understanding of execution plans to check that we get scans where we want scans, seeks where we want seeks, and all the right types of joins.
The undocumented feature that we can use here actually has a very simple and easy-to-remember syntax. You just run an UPDATE STATISTICS command, adding the options “WITH ROWCOUNT=xxxx, PAGECOUNT=xxxx” to force SQL Server to store the numbers you mention as the rowcount and pagecount. Clear the plan cache, then test your queries and check how they behave with the simulated number of rows and pages. I could give a much longer description and examples of the usage, but others have already done so – see for instance this post from the horse’s mouth (aka the Query Optimization Team), this post from Benjamin Nevarez, or if you are able to read German this excellent post from Uwe Ricken.
… and how to get rid of them
But what all those blog posts fail to mention is what happens later. Of course, if you just create a small test table, slap in a few sample rows, fake big statistics, check the execution plan and then drop the table, that is not an issue. But what if you have a test database that contains a lot of carefully crafted test data, and that is used for other tests as well? You may not want to run all the tests with those artificially bumped rowcounts!
In my case, I ran into this in the AdventureWorks sample database, my loyal companion for presentations and demos. I will present a pre-conference seminar on Execution Plans in Exeter (March 21), in Copenhagen (March 28), and in Zagreb (June 2). For my preparations, I wanted to force a hash spill warning, so I artificially lowered the rowcount for the SalesOrderDetail table to just 200. That worked just fine – this nifty feature can be used to mimic every rowcount, not just the big numbers.
Because I use this sample database for other purposes as well, I wanted to immediately reset the statistics to their normal value. I figured rebuilding all the statistics on the table with the FULLSCAN option would do the trick. Right? Wrong! The UPDATE STATISTICS did run, it took a while to scan all rows in the table (as requested) – but then it still retained the rowcount and pagecount values that I had forced earlier! And there is no visible indication at all – executing the UDPATE STATISTICS … WITH FULLSCAN statement simply runs for a while, then reports success.
It was actually mere coincidence that I later did some other tests on the same table, and just happened to notice that the estimated rowcount for this table was still 200. It then took me at least an hour of searching the internet and trying many different options (including using zero or a blank value for the rowcount and pagecount parameters, combining different other options of the UPDATE STATISTICS command, clearing or not clearing the plan cache, and a few other things I don’t even remember. None of them worked. I could of course count the actual number of rows and pages and use that, and that would have worked because my AdventureWorks database never changes – but for a test database where the data can actually change over time, this options would not work.
In the end, I finally found one a method that works. But it is really a sledgehammer approach, and I prefer not heaving to do this on large tables on a regular basis: ALTER TABLE Sales.SalesOrderDetail REBUILD. After running that statement, I found that the statistics on the SalesOrderDetail table had finally reverted to their normal behavior. Until the next time I need to fake a different number.
On a development server, using the WITH ROWCOUNT and WITH PAGECOUNT options of the UPDATE STATISTICS is a fine method to simulate large numbers of rows, or to simulate the effect of wrong statistics. But unfortunately, the numbers given stick for far longer that I like; this is not documented anywhere, and not easy to undo. Rebuilding the table and indexes appears to be the only solution.
If you have found other ways to return from fake row- and pagecounts to the normal statistics behavior, post a comment and share your knowledge!
If you want to restore the real values for rows and pages you can use the DBCC UPDATEUSAGE statement. Actually, DBCC UPDATEUSAGE can be used to correct pages and row count inaccuracies in the catalog views. For example you can try
DBCC UPDATEUSAGE(AdventureWorks2012, ‘dbo.Address’) WITH COUNT_ROWS
Actually the ROWCOUNT and PAGECOUNT options of the UPDATE STATISTICS statement do not change any statistics object but the number of rows and pages of a table. You can see this information with the sys.dm_db_partition_stats DMV.
Thank you for posting that correction, and for providing the better way to reset the counts, Ben!
Thomas Kesjer and I discussed statistics workarounds a few years back. I recall that he noted that when the true statistics (in stats_stream, formerly statblob) were sampled at 100%, the ROWCOUNT and PAGEGOUNT fake values were ignored, but not if sampling were less than 100%. TK might comment on this, or you could look into it as my memory is not as good as it used to be. I mentioned to Adam Mechanic a while back that I could decode the stats_stream/statblob binary, and in fact, write my artificial statistics histogram. I did this in SQL Server version 2000 and posted some details on SQL-Server-Performance. Of Microsoft then added a checksum value to the binary in version 2005.
I have asked various Microsoft people several times if they could provide the specific checksum function so I could continue my research and would strongly advice against apply artificial statistics to production systems. This was politely (or not) declined.
Adam took an interest in this, but decided it was easier to make an actual table(s) with similar structure and desired distributions, generate statistics on this, then apply another table.