I was unsure when I read this assignment. I don’t recall ever getting emotional over code. Sure, I can feel pretty smug when I have optimized a query to slash its execution time by 99%, or when I see that the program code to solve a complex issue actually is pretty clean. But in those cases, it’s my achievement that makes me proud, not the code.
And, yes, I can also get quite upset when I encounter code that spits in the face of all best practices, that completely bypasses all security rules, that crawls even slower than a snail in a tar pit, or that does one of the many other stupid things I have seen in my career. But in that case, it is the stupidity of whoever wrote the code that causes my uproar, not the code itself.
But I wanted to contribute anyway. So here is a recent example of code that probably would have made me feel a way if I had been the type of person that gets emotional over code. Or put differently, here is the story of how I gained performance by reducing readability and maintainability.
For the record, and to prevent confusion, I am not going to name actual customers, nor name the ERP system used, and the description I give is highly abstracted away from the original problem, and heavily simplified as well. I describe the basis of what the issue was with the code I encountered and how I fixed it, but without revealing any protected information.
A table with history
As you probably know, I am currently on sick leave. I don’t have the energy to do normal work. But when a customer calls for a short gig, that I can do in a few days, working just a few hours per day, then I can do that. And that’s what recently happened.
A customer called. They had trouble with a report. It had always been slow, but they had learned to accept that. As long as it finished in an hour or so, they could deal with it. After a recent upgrade of their ERP, however, it didn’t finish anymore. They had even tried an overnight run, which they cancelled the next day, after 17 hours, when there were still no results. Could I perhaps look at the query and advice how to make it faster?
The report was based on customer ledgers. Those were distributed among two tables, one for the header information, and one for the details (very similar to how sales order, invoices, etc are typically stored – just look at SalesOrderHeader and SalesOrderDetail in the AdventureWorks sample database to see an example).
These tables were also huge. Most financial systems support some way of consolidating history. Once a financial year is closed, all transaction details of that year are aggregated, and then removed in the database with one single transaction, at the start of the new financial year, for the opening balance.
But something being supported doesn’t automatically mean it’s also being used. I’ve seen multiple companies that simply never consolidate their closed financial years. They are afraid to throw away data that might perhaps some day be useful (narrator’s voice: it never is). And as a result, tables with transaction history, customer ledger history, and several similar tables, just continue to grow and grow and grow.
This can become especially painful when the database does not store the current balance of a customer (in the case of customer ledgers) or account (for general ledgers). From a pure design perspective, that makes sense: “thou shalt never store derived data”. The current balance can always be derived by simply aggregating all transactions, from the start of time until now. But “simply” aggregating might not actually be simple once the database contains transaction history going back twenty years. Now, computing the current balance becomes a huge performance hog.
Well, this particular ERP did indeed not store the current position of customers. It had to be computed by going over the full customer ledger history. And because this customer is one of those who do not consolidate closed years, that was a lot of history to go through. The number of rows in the customer ledger header table was an 8-digit number.
A well-designed query
When I looked at the query, I noticed how well-designed it was – when thinking from the customer’s perspective. All the logical steps to get to the report data were clearly visible, so there was a nice one on one translation from requirements to code.
The first step was to extract the relevant customer ledger data. A cutoff date was used, passed in by the front-end, because the report typically would show data up to the end of the previous month.
WITH RelevantLedgers AS (SELECT some columns FROM dbo.CustomerLedgerHeader AS clh INNER JOIN dbo.CustomerLedgerDetail AS cld ON cld.CustomerLedgerID = clh.CustomerLedgerID WHERE clh.TransactionDate <= @CutoffDate),
However, the report was not at the level of individual customer ledgers. The data had to be aggregated at the customer level. And one of the elements reports was, of course, the current open balance of each customer. This was computed in another CTE:
CustomerBalances AS (SELECT CustomerID, SUM (AmountDue - AmountPaid) AS CustomerBalance FROM RelevantLedgers GROUP BY CustomerID),
And that was not all. The report also had to show the total new claims for the “current” month (from the perspective of the passed in cutoff date, so in effect typically last month), and the month before. Here, we once more saw that the SQL developer had written a CTE for each of these requirement:
TotalsThisMonth AS (SELECT CustomerID, SUM (AmountDue) AS NewClaims FROM RelevantLedgers WHERE TransactionDate >= DATEADD (DAY, 1, EOMONTH (@CutoffDate, -1)) AND TransactionDate < DATEADD (DAY, 1, EOMONTH (@CutoffDate)) GROUP BY CustomerID), TotalsLastMonth AS (SELECT CustomerID, SUM (AmountDue) AS NewClaims FROM RelevantLedgers WHERE TransactionDate >= DATEADD (DAY, 1, EOMONTH (@CutoffDate, -2)) AND TransactionDate < DATEADD (DAY, 1, EOMONTH (@CutoffDate, -1)) GROUP BY CustomerID)
And then, for the final report, the data from all these CTEs was combined with the base customer data:
SELECT some columns FROM dbo.Customers AS c LEFT JOIN CustomerBalances AS cb ON cb.CustomerID = c.CustomerID LEFT JOIN TotalsThisMonth AS tm ON tm.CustomerID = c.CustomerID LEFT JOIN TotalsLastMonth AS lm ON lm.CustomerID = c.CustomerID;
All in all, a beautiful query. Sheer perfection. Well, apart from the little detail that it never finished.
The cost of beauty
I like it when code looks good. I much prefer beautiful, easy to understand, and easy to maintain code over messy constructions. Most of the time, I am even willing to sacrifice some performance when that results in code that is easier to maintain going forward, because in the long term that will likely save the customer more money than a few percent performance gain.
But there are limits to the amount of performance one should sacrifice for better looking code!
In this case, we had a CTE that very neatly abstracted the details of how to find relevant customer ledgers away from the rest of the query. And that CTE was then used three times, to get three different aggregations by customer. Those three different aggregations were then joined together.
The optimizer can do a lot of things. But it cannot do magic. When you ask it to do three aggregations over the same source data, but with different filters, then you will get an execution plan where the source data is read and aggregated three times. And if you are lucky, those three are then combined using Merge Join or Hash Match. If you’re unlucky, then the estimates could at one point be so far off (remember, the more complexity you have in your query, the harder life gets for the cardinality estimator!) that the optimizer even decides to put one of those aggregations on the inner side of a Nested Loops, causing it to execute perhaps even thousands of times!
And indeed, that was exactly what I saw in the execution plan of the problem query. Bad estimations, getting worse with each next join or aggregation; multiple branches that each repeated the same pattern of reading CustomerLedgerHeader and CustomerLedgerDetail, joining them, and aggregating that data; and indeed at one point where the estimated number of rows was very low (and probably incorrect, though I did not have an execution plan with run-time statistics [aka actual plan] to verify) and hence a Nested Loops was used.
I wanted SQL Server to do just a single pass over all the data. Read the tables once. Join them once. And aggregate them once. And that is possible in this case, provided you rewrite the query appropriately. The optimizer will not do this for you. It can do a lot, but not everything.
Here is the rewrite I suggested to my customer:
WITH RelevantLedgers AS (SELECT CustomerID, TransactionDate, AmountDue, AmountPaid FROM dbo.CustomerLedgerHeader AS clh INNER JOIN dbo.CustomerLedgerDetail AS cld ON cld.CustomerLedgerID = clh.CustomerLedgerID WHERE clh.TransactionDate <= @CutoffDate), CombinedAggregations AS (SELECT CustomerID, SUM (AmountDue - AmountPaid) AS CustomerBalance, SUM (CASE WHEN TransactionDate >= DATEADD (DAY, 1, EOMONTH (@CutoffDate, -1)) AND TransactionDate < DATEADD (DAY, 1, EOMONTH (@CutoffDate)) THEN AmountDue END) AS NewClaimsThisMonth, SUM (CASE WHEN TransactionDate >= DATEADD (DAY, 1, EOMONTH (@CutoffDate, -2)) AND TransactionDate < DATEADD (DAY, 1, EOMONTH (@CutoffDate, -1)) THEN AmountDue END) AS NewClaimsLastMonth FROM RelevantLedgers GROUP BY CustomerID) SELECT some columns FROM dbo.Customers AS c LEFT JOIN CombinedAggregations AS ca ON ca.CustomerID = c.CustomerID;
Here, we very explicitly specify just a single pass over the RelevantLedgers CTE. And we use what in Excel would be called a “conditional sum” to compute the new claims in just this month or in just the last month. This query now runs a lot faster!
(As mentioned above, the actual query was more complex than this. And there were more issues with it than just this one. But I do believe that the repeated passes over the customer ledgers were the main cause of slowness. With all other optimizations combined, the query still did not finish in the two hours I had left it running. After also adding this optimization, it was done in less than 7 minutes!)
Price to pay
But this performance gain does come at a cost. The customer has limited staff available. Nobody there is specifically specialized in SQL Server. For regular readers of my blog, the logic of the SUM(CASE …) expressions in the CTE might be simple to understand. For their staff, this is advanced stuff. If, one year from now, someone needs to implement a change that affects this report, they will have a hard time understanding this code. Much harder than they would have with the original query.
This also reflects in the name I gave to the CTE. “Combined aggregations”. It doesn’t get more generic than that. I could just as well have called it “MyCTE” or something similar. There is zero connection between this name and the business specification of the report. Whereas the CTEs in the original query followed logically from the report specifications. Just the names alone made the query self-documenting. No additional comments are needed when a CTE is called “TotalsLastMonth”.
This lack of clarity in the code can partly be alleviated by adding comments. But I will not deny that rewriting the code had, in this case, an adverse effect on future maintenance.
Often, good-looking and easy-to-read code performs quite well. That is a win-win situation. But we are not always that lucky.
If you can afford the performance hit, then I would always advocate for using code that is easier to read, easier to understand, and easier to maintain going forward. The code of more expensive maintenance in the future almost often outweighs the benefit of better performance.
But there are exceptions. A report that is still not finished after 17 hours. A query that executes a hundred times every time a user clicks a button, so that even a millisecond performance difference results in a full second lag on the client. Or a query that runs millions of times in a cloud environment, where the payment is based on resource usage. In those exceptional cases, we sometimes do have to sacrifice readability in order to gain performance. Documentation and comments can help alleviate the pain of the reduced readability. But they can’t take away the fact that there is a price to pay for such performance gains.