This post is for T-SQL Tuesday 178, hosted by Deepthi Goguri. She invites us to write about a recent technical issue we resolved.
So let’s talk about the time I discovered that information in the Query Store might not be what it appears to be.
Unexpected delays
I was hired by a customer who had a very annoying issue with the daily data load of their data warehouse. The volume of data to be loaded is high and they were already struggling to finish the load before business opens. But that was not their biggest issue. The biggest problem, the real pain point that they hired me for, is that at unpredictable moments, the load would run much longer than normal, pushing it well into business hours. They wanted me to find out what caused those irregular delays, and find a way to stop them from happening.
Query Store
The Query Store is a great feature for situations like this, where you have to figure out what happened in the past. So during the intake and contract talks, I already told them to please enable the Query Store, which they did. By the time I started the actual work, there was already over a week of data in the Query Store. And in that week, they had indeed experienced two runs that were several hours slower than normal.
I was very disappointed when I saw nothing of interest at all in the Query Store report “Queries with High Variance”. Nor in the “Top Resource Consuming Queries”, when I sorted by elapsed time. Or rather, I saw lots of interesting things, but none that were even remotely related to the problem I was hired to solve.
I dove in deeper. They gave me the detailed log report from their scheduling tool, that showed start and end time of each stored procedure in the daily load. Indeed, most stored procedures finished in seconds to minutes, but one of them took over two and a half hours. Whereas on other dates, that same stored procedure, loading the same amount of data, needed less than ten minutes.
I looked at the stored procedure, which contained multiple statements. I then wrote some queries on the various sys.query_store_* DMVs, to find data about executions of each of those statements. And that is where it got really interesting!
Mismatched times
The log report from the scheduling tool showed that this specific stored procedure had started at 0:20 (12:20 PM for my American friends), and then finished at 2:45. A run time of well over two hours, which is way too long for the load and manipulation steps of this 600-million-row table.
But in the Query Store, I saw that the very first query in the stored procedure had only started executing at 2:35! And after that, the three other steps followed, with the last step completing at 2:45. That last bit matched the log information from the scheduler. But the first thing didn’t! Why did their scheduler record a start time of 0:20, when Query Store records the first query starting at 2:35? What happened during those two and a quart hours?
The reason
I won’t bore you with all the theories I explored, tested, and had to reject. I won’t tell you about the many hours I spend poring over the internal code of their scheduling tool, trying to find what it does in between logging the start of the stored procedure and actually executing it (narrator: it did absolutely nothing in between those two events, at least nothing that could ever last more than a few microseconds). I’ll just cut to the chase and tell you the reason.
One of the steps in their data load is to copy imported data from the load table to the stage table, combined with a few standard data transformations. The query that their tools generate for this step uses SELECT DISTINCT. Why? No clue. Safety first, I assume. But the data in the load table is guaranteed to never have duplicates, so this DISTINCT is not needed. And in this case, it hurt them in more ways than you might think!
Even for a simple INSERT … SELECT, the Query Optimizer wants to estimate the number of rows. Which is easy … until you add DISTINCT. Now, the Query Optimizer needs to assess how many duplicates there are in each column. Which means that it needs statistics on each column. And indeed, when I checked the load table, it had auto-generated statistics on each and every one of its column. Of which there were a lot!
Another “fun fact” is that this customer had, somewhere in the past, ran into a performance issue that turned out to be caused by non-representative statistics. They found that forcing an update of those statistics with the FULLSCAN option solved their issue. And so they made the mistake that many companies and people make: they assumed that this would be a fix for everything. Ever since then, their policy was to set FULLSCAN as the persisted sample rate for all statistics.
Order of events
The next part is a bit technical and under the covers, but it is needed to understand the weird phenomenon I had seen. So bear with me.
When SQL Server is asked to execute a stored procedure, it first generates execution plans for each and every statement in that stored procedure. For generating those execution plans, it uses statistics. If any of those statistics are considered out of date, then, assuming default database settings, it will first update the outdated statistics, and then generate the execution plan using the new statistics.
In this case, that little extra step of first updating the outdated statistics turned out to be not little at all. As mentioned, the stored procedure filled the staging table by reading from the load table and doing a few simple transformation. But also with this, actually needless, DISTINCT on a list of several dozens of columns.
So, on some days, the statistics on the load table were still fine. The stored procedure compiled, and then it ran, and a few minutes later it was done. Fine.
And then, on other days, there had been too many modifications on the load table. The statistics on its columns were considered outdated. All of them. So, before compiling the execution plan, SQL Server first had to update these statistics. All of them, several dozens. On a 600-million-row table. With FULLSCAN.
And SQL Server is not able to update multiple statistics with a single scan. It has to scan the table once for each statistics it updates. Which, as mentioned, were several dozens.
Mismatch explained
This (along with a few other things that exacerbated the issue, but that are beyond the scope of this post) explained why the same load procedure, on roughly the same amount of data, would some days run much longer than on other days. But it doesn’t explain the huge gap between the start time of the stored procedure as logged by their scheduling tool and the start time of the first query as recorded in Query Store. Or does it?
It is important to be aware what Query Store records. Queries, you might think. Right? But in reality, it is query executions.
Now think back of the order of events I just described. Compilation of the stored procedure takes place before it starts executing. Updating the statistics, when needed, is done as part of the compilation, and by default, the optimizer waits for the statistics updates to complete before it produces an execution plan. So that is the order of events: (1) call stored procedure; (2) start compilation; (3) update statistics; (4) finish compilation; (5) start actual execution.
If steps 1 through 4 take two hours, then execution of the first statement starts two hours after the stored procedure is called. Exactly the issue I had seen!
The fix
After finding the root cause, I expected the fix to be easy. Just enable the AUTO_UPDATE_STATISTICS_ASYNC option on the database, so that the automatic update takes place in the background and doesn’t cause delays anymore. Yeah. It worked to achieve what I wanted, but now it turned out that several other queries started to run extremely slow when they were compiled with outdated statistics. So we rolled back that change.
The next attempt was to get rid of the persisted FULLSCAN sample rate on all statistics. That reduced the time to update statistics dramatically. But it also brought back the problems that they had fixed in the past by using FULLSCAN. However, we did not roll back this change. We enabled FULLSCAN for only those tables and statistics where it was actually needed, instead of for all tables and all statistics. This turned out to be sufficient to have a good enough performance across the line.
Of course, I also want the customer to remove that useless DISTINCT from their load procedure, and then manually remove all those auto-created statistics on the load tables. If they are not there, then SQL Server won’t update them. And it won’t create them either when they are not needed. Since this requires a change in the code generation, this is now an item on their backlog that they have not gotten around to yet. But I am confident that this will reduce their data load time even more, and remove the last of the irregular delays.
Conclusion
The main reason I decided to describe this problem, and not of the many others I fight on a daily basis, is because I wanted to share this interesting observation, that there are cases where the start time of a query as recorded in the query store might be (a lot) later than the time you (or a tool) executes the query, or the stored procedure.
The example in this post is about a stored procedure. But the same can happen when running an individual query. Compilation might take long because the query is complex. Or because the optimizer needs access to data that is locked. Or because statistics need to be updated or created first. If that happens, then the start time as recorded in the Query Store will not be when you executed the query, but when the execution plan is ready and it actually starts executing.
I have a suspicion that time waiting for a memory grant to become available is also not tracked in the Query Store, but I was unable to confirm or reject this theory. If you know more about this, then please leave a comment!
Thanks, Deepthi, for hosting this T-SQL Tuesday!
11 Comments. Leave new
Great blog! QQ, were they using OPTION RECOMPILE for these tasks? BTW, 0:20 is 12:20 AM, not PM!! 😉
No, there was no OPTION RECOMPILE. But the queries ran only once a day, and there were huge data load tasks in between, so it frequently happened that the cached plan was no longer valid (or even already purged), and a new plan had to be compiled.
Sorry for messing up with the American time standard. I always get confused by it, especially around midnight. Guess I should have stuck to European notation.
Thank you! This was insightful and brilliant!
@Hugo
Hi Hugo,
I was curious about this statement:
“Even for a simple INSERT … SELECT, the Query Optimizer wants to estimate the number of rows. Which is easy … until you add DISTINCT. Now, the Query Optimizer needs to assess how many duplicates there are in each column. Which means that it needs statistics on each column.”
Perhaps I misinterpreted, but it sounded like you were saying auto-statistics would be created for every column in a table if one does a SELECT DISTINCT * from SomeTablename query? I tried to replicate this using the StackOverflow database, by looking at the statistics before and after running such a query. I found that automatically generated column statistics (name begins with WA_Sys_xxxxxxx) are not being generated. So I’m a bit confused.
Thank you
Hi Tom,
As you already found out (see comment below), this does indeed have to do with whether or not a primary key (or, probably, any sort of unique index – although I didn’t test) is included in the column list.
The staging tables at this customer are heaps and have no constraints at all. Not unusual for specifically staging tables.
I think I just answered my own question. If the table includes a primary key (as pretty much all tables should) and one does a SELECT DISTINCT * from TableName, apparently SQL Server is smart enough to know that the primary key must be unique so it short-circuits the need to collect statistics on the remaining columns in the table.
use StackOverflow;
go
exec dbo.DropIndexes;
go
— No column statistics generated by the following three queries:
select * from dbo.Badges;
select distinct * from dbo.Badges;
select [Name], UserId, [Date] from dbo.Badges;
/* This query does indeed result in automatically-generated statistics
on the [Name], UserID and [Date] columns: */
select distinct [Name], UserId, [Date] from dbo.Badges;
Sounds like a lot of fun (to read at least)! Great post
hint: you can identify those type of queries at runtime when you are executing either sp_WhoIsActive (http://whoisactive.com) or Brent Ozar’s sp_BlitzWho. Both will you show no query plan in this case.
So when you are executing a query and get no plan even after > 1 min, it is very likely that the SQL server is creating or updating statistics in the background.
This can particulcarly happen, when you create and fill a new but already large table. The first time somebody starts a query with new columns in the WHERE / JOIN the SQL server needs to create statistics for this column first, which may take a while and can’t be bypassed by an async stat update.
You could only forbit to auto-create statistics but have to ensure by yourself, that every needed statistic exists – and this is really nothing you want to do.
PS: you can execute the two mentioned procedures in a SQL agent job every x minutes and write the results into a table (both have a parameter for this), so you can analyze it in the morning instead of taking a boring night shift.
Thanks for the good suggestions, Thomas!