In a previous blog post, I demonstrated just how much you can hurt your performance by encapsulating expressions and computations in a user-defined function (UDF). I focused on scalar functions that didn’t include any data access. In this post, I will complete the discussion on scalar UDFs by covering the effect of data access in a scalar UDF. Note that, like the previous post, this all applies to T-SQL user-defined functions only. SQL Server also supports CLR user-defined functions (written in a .Net language like C# or VB.Net); those are not in the scope of this blog post.
Data access
So far, I have been looking at an example of a UDF that only does a calculation. A very simple calculation, but this UDF stood model for other, probably more complex UDFs that manipulate their arguments without reading additional data. However, in a database system, these are not the most common UDFs – in my experience, a majority of the UDFs in any database will read some data from some tables. Are these UDFs that do data access as bad as those that don’t? Let’s find out. In the code below, I create a new UDF that doesn’t compute the triple value, but instead looks it up in a table that holds the triple values of all numbers between -100,000 and 100,000. This mimics what I think is a rather common data access pattern in a UDF: reading a single row from a lookup table, based on the primary key value.
CREATE TABLE dbo.Triples
(Value int NOT NULL PRIMARY KEY,
Triple int NOT NULL);
— Triples of numbers 1 – 100,000
WITH Digits
AS (SELECT d FROM (VALUES (0),(1),(2),(3),(4),(5),(6),(7),(8),(9)) AS d(d))
INSERT INTO dbo.Triples (Value, Triple)
SELECT 10000 * tt.d + 1000 * st.d
+ 100 * h.d + 10 * t.d + s.d + 1,
(10000 * tt.d + 1000 * st.d
+ 100 * h.d + 10 * t.d + s.d + 1) * 3
FROM Digits AS s, Digits AS t, Digits AS h,
Digits AS st, Digits AS tt;
— Triples of numbers -1 – -100,000
INSERT INTO dbo.Triples (Value, Triple)
SELECT -Value, -Triple
FROM dbo.Triples;
— And add the number 0
INSERT INTO dbo.Triples (Value, Triple)
VALUES (0, 0);
go
CREATE FUNCTION dbo.TripFromTbl(@Input int)
RETURNS int
WITH SCHEMABINDING
AS
BEGIN;
DECLARE @Result int;
SELECT @Result = Triple
FROM dbo.Triples
WHERE Value = @Input;
RETURN @Result;
END;
To compare the performance of this UDF versus the inline equivalent, I would normally replace the UDF call with a correlated subquery. But my test query has embedded the call to the UDF in an aggregate function, and SQL Server apparently doesn’t like to have a subquery as the argument to an aggregate function – so I have to make the code a bit more complex to work around that limitation. (I normally would have rewritten the query as a join, but that would also be an optimization, and in this case I wanted to stay as close as possible to the original and see what the optimizer does – the CTE is the direct replacement of SELECT dbo.TripFromTbl(DataVal) FROM dbo.LargeTable;).
SELECT MAX(dbo.TripFromTbl(DataVal)) AS MaxTriple
FROM dbo.LargeTable;
WITH TheTrips
AS (SELECT (SELECT t.Triple
FROM dbo.Triples AS t
WHERE t.Value = l.DataVal) AS TheTrip
FROM dbo.LargeTable AS l)
SELECT MAX(TheTrip) AS MaxTriple
FROM TheTrips;
Let’s first take a look at the execution plans. More lies? Yup! But in this case, the worst culprit is not the execution plan; we’ll see far worse lies in a bit. But let’s not get ahead of ourselves. I selected the two queries, hit the execute button, and after waiting 7½ minutes, I was able to make this screenshot of the execution plans for the queries above:
The plan for the query without UDF is pretty smart – though not optimal. The optimizer has reversed the order in which tables are accessed. It will first read the Triples table, and if you drill down in the properties of the Clustered Index Seek operator, you will see that it has used its knowledge of the CHECK constraint on the DataVal column that will be used in the join iterator – only rows with Value >=1 and < 10 are read. In the parallel part of the plan, the large table is then read entirely and joined to the rows from the Triples table. A hash match join iterator is used, so that the table is read only once, at the cost of some CPU and memory overhead. The iterator uses right outer join semantics, to ensure that rows from the LargeTable are not lost if a value is missing from the Triples table – we know this is not the case, but the optimizer doesn’t. The results are then aggregated (to find the maximum triple in each thread), and then those maximums are aggregated again to find the maximum between all threads.
This is not an optimal plan. The value coming from the Triples table will be the same for all rows with the same DataVal in LargeTable, so instead of reading them all, then aggregating them back to a single row per Triple value, it would suffice to read only a single row to check for the existence of a particular DataValue in LargeTable. But that is apparently not a transformation that the optimizer is currently able to make – yet? (Are you reading, Conor?)
The execution plan for the query with UDF is a lot simpler. It is, in fact, exactly the same as it was for the earlier versions of the UDF. And even though SQL Server cannot take advantage of parallelism for this query, it is still considered cheaper than the version without UDF: 44% of the total batch. Lies, of course – to see how bad the lies are in this case, I disabled the actual execution plan option and added the SET STATISTICS TIME ON statement, but in this case I also added a SET STATISTICS IO ON statement to see how much I/O the two alternatives use. I then clicked the Execute button and waited for another 7½ minutes.
The output of the statistics IO is clear. For the version without UDF, there are 21,394 logical reads on the LargeTable (slightly more than its 21,088 pages; the parallelism causes a few pages to be read more than once), 2 reads on the Triples table, and it also mentions two WorkTables, but they have no logical reads. Pretty good. But the version with UDF again appears to win hands down: it uses 21,088 logical reads on LargeTable (obviously) – and nothing more! According to this output, the query with UDF manages to read all those triple values from the Triples table without ever accessing it! More lies? You bet! In this case, the lie is that the statistics io output never includes any I/O that is done within a UDF. So if I rewrote the UDF to do twenty table scans on a table with a billion rows each time it is called, the statistics io output would still insist that that table wasn’t read at all – though the execution time of the UDF would tell a quite different story. To really see the data access done by a UDF, you’ll have to have a profiler trace running while executing the query that invokes it. If you repeat this query with an active profiler trace that includes the relevant events, you’ll get an enormous lot of output – so please don’t do this on a production box! In the collected output, you’ll see a block that starts with an SP:Starting event (the function is invoked), then has two sets of SP:StmtStarting and SP:StmtCompleted events (the two executable statements in the function), and finally an SP:Completed event (the function has ended) – and this block is then repeated ten million times! In the Reads column of SP:StmtCompleted, you can also see the number of logical reads: only 2 per execution of the function, since all it does is a lookup on the clustered index key in a small table. And the best news is that you don’t even have to do any of the math yourself, because when the query finally finishes, the SQL:StmtCompleted event will show you the total I/O in the relevant columns – in this case the Reads column, which shows a grand total of 20,021,088 reads. That’s 20 million for the UDF (2 reads in each of the 10 million executions) that STATISTICS IO doesn’t report, plus the 21,088 reads on the LargeTable table that STATISTICS IO does report.
(Note: If you need to do serious data collection for trouble-shooting a potential UDF issue, I recommend only including the SQL:StmtCompleted event with the Reads column and not including SP:StmtStarted, SP:StmtCompleted, SP:Starting, and SP:Completed. That makes the trace much more lightweight, and still gives you the information you need. Only include the other events if you need to drill down to the level of statements within the UDF, and if you do, then only run tests on very small numbers of rows).
So, time for the ultimate measure (and the only measure that we can get a truthful report on without having to go outside Management Studio): the actual execution times. For the version without UDF, the CPU time is 7021 ms, and the elapsed time is 1406 ms. And the version with UDF took 488,470 ms CPU and 494,533 ms elapsed – a 70x increase in CPU, and even a 350x increase in elapsed time, from 1½ seconds to over 8 minutes! Do you now understand why database consultants have trouble suppressing their smile when their new customer turns out to have littered the database code with UDFs? It’s the ultimate low-hanging fruit!
Lies, damned lies, and estimates
The original version of the UDF, that didn’t use data access, already profited enormously from a rewrite that capitalized on the fact that there are only 10 distinct values for the argument with which it is called. This version would profit even more – rewrite the query to the one below, and the execution time is down to 3838 ms CPU, 3871 ms elapsed – still no parallelism anywhere in the plan, but nevertheless a great improvement.
SELECT MAX(dbo.TripFromTbl(DataVal)) AS MaxTriple
FROM (SELECT DISTINCT DataVal
FROM dbo.LargeTable) AS d;
The UDF does not contain any elements that would make it non deterministic, and indeed, the IsDeterministic property of this UDF is 1. So that is not the reason why the optimizer would not consider this transformation. So what is the reason? This time, it is not a zero-cost assumption for the UDF – take a look at the simple query below and the associated estimated execution plan:
SELECT dbo.TripFromTbl(3);
If you drill down the properties of the iterator, you will see that the estimated subtree cost for the UDF is 0,0032831. Not entirely accurate – it is the same cost I get for a query that directly does the corresponding select from the Triples table, so the overhead for invoking the UDF is not taken into account; but I think we can safely assume that this overhead is minimal in relation to the I/O cost anyway.
Since the UDF is executed once from the main query, I would expect the estimated operator cost for the Compute Scalar in the query to be about the same number – but it isn’t. This cost shows up as only 0,0000001. According to the query plan, the iterator that invokes the UDF costs less than 1/100th percent of that UDF itself. Clearly, in this case the optimizer does know something about the cost of the UDF, but still refuses to take this into account. “Yes, I know that this UDF is not free, but for the purpose of optimization, I’ll assume it is anyway”. Why? I have no idea – but I did report it as a bug on the Connect website.
The remedy
In the previous post, I told you to avoid using scalar UDFs that don’t do any data access. In this post, I showed that the performance hit of scalar UDFs that do include data access is even far worse, so you won’t be surprised when I tell you that you should avoid these as well, maybe even more – with the same exceptions (a SET statement without subquery, so that the UDF is processed only once; maybe a SELECT statement on a small table in a part of your application that is not on a time-critical path).
If the logic in the UDF is relatively simple, you can often replace the call to the UDF with a simple subquery, possibly with a CASE expression somewhere to mimic an IF … ELSE in the UDF. This is not possible if the UDF is called inside an aggregate function, as we saw at the start of this post. And if the UDF is complex, replacing it with a single subquery may also prove to be too hard, or it results in a monster query that nobody will be able to maintain, or that due to its sheer complexity manage to throw off the optimizer. In cases like that, the best advice is to bite the bullet and do the work of completely rewriting the original query without using the UDF. Depending on the complexity of the problem, this may eventually result in breaking up the query in multiple steps, using temporary tables or table variables to store intermediate results. Not my first choice, but almost bound to perform better than any solution that calls a scalar user-defined function with data access.
For the query I used in this blog post, my actual workaround would not be the complicated construction above, but a simple rewrite with a join:
SELECT MAX(t.Triple) AS MaxTriple
FROM dbo.LargeTable AS l
INNER JOIN dbo.Triples AS t
ON t.Value = l.DataVal;
This rewrite has several advantages over the original version. The most obvious is that it’s much easier to understand and maintain. A second advantage is that the execution plan does not have to take into account the possibility that values may be missing from the dbo.Triples table (and indeed, the only difference between the plan for this query and the plan for the first query without UDF is that the Hash Match join iterator has been changed from outer to inner join). This also gives the optimizer more options to find better plans – it doesn’t in this case and both queries perform equal, but if you add a nonclustered index on the DataVal column in the LargeTable, this query will benefit the most.
Or, if performance is really important, I would use a version that forces the optimizer to first do a DISTINCT on all the DataVal values before joining to the Triples table – less obvious to read and maintain, but in this specific case (because there are so few distinct DataVal values and so many rows in LargeTable) it is about 50% faster than the alternative:
SELECT MAX(t.Triple) AS MaxTriple
FROM (SELECT DISTINCT DataVal FROM dbo.LargeTable) AS l
INNER JOIN dbo.Triples AS t
ON t.Value = l.DataVal;
And a completely different way to tackle the performance impact of a scalar UDF with data access is to rewrite it as an inline table-valued function. As already promised in the first part of this series, I will cover this technique and discuss its benefits and drawbacks later.
What’s next?
I have not yet decided on the next part. My original idea was to limit this series to T-SQL user-defined functions (hence the title). But a comment I received by email has made me reconsider this plan; I am now considering the possibility of discussing the effects of CLR user-defined functions as well (though the title of the blog post series would then be slightly incorrect).
If I do decide to cover CLR functions too, then the next part will be on the scalar variety of CLR functions (or, if you wish, the CLR variety of scalar functions) to wrap up the discussion on scalar functions. If I decide to leave out CLR, my next post will be about multi-statement table-valued functions. In that post, I will show that, while scalar functions may be bad, multi-statement table-valued functions are downright ugly!
1 Comment. Leave new
Hi Hugo,
I realise the main point of this post is to show the poor performance characteristics of scalar UDFs that access data (and associated costing limitations) but you expanded the discussion to optimizer stuff, so I have some comments… 🙂
The small sizes of your example tables do not prompt a full effort from the optimizer. Faced with relatively small costs for obvious plans, it terminates its search early with Good Enough Plan Found, which is a good thing in general.
The point about trying different syntax (while being careful about semantics!) is a good one. The optimizer will never include all possible logical transformations, just a good set that produce benefits for a wide range of common queries (trading ultimate plan quality for compilation time).
On that note, a useful and natural way to express the query requirement is:
SELECT
MAX(t.Triple)
FROM dbo.Triples AS t
WHERE EXISTS
(
SELECT 1
FROM dbo.LargeTable AS lt
WHERE
lt.DataVal = t.Value
);
This produces a plan with the pre-aggregation you are looking for, but we can do even better by adding the obvious index on LargeTable (DataVal):
CREATE NONCLUSTERED INDEX nc1 ON dbo.LargeTable (DataVal);
That gives us a plan with no aggregation at all, just two seeks, a semi-join, and a stream aggregate. The estimated cost is 0.0305208, compared with 0.296198 for the manual rewrite using DISTINCT.
Though not immediately important in the specific case you show, we could also help the optimizer by creating a UNIQUE constraint on Triples (Triple), and enforce a FOREIGN KEY relationship from LargeTable (DataVal) to Triples (Value).
I generally prefer APPLY to subqueries (a sub-SELECT in an outer SELECT clause). APPLY tends to optimize better in my experience, especially when useful FOREIGN KEY and NOT NULL constraints are present.
Paul