A recurring question in the newsgroups is: “How do I output unrelated rows side by side?”
I’ll use the AdventureWorks database as an example. Suppose that my manager wants a report listing, for each order, all products ordered and all reasons the customer has for ordering from AdventureWorks. Here’s a sample of how the report should look.
SalesOrderNumber | ProductID | Reason |
SO51318 | 858 | Other |
SO31518 | 870 | Price |
SO31518 | 872 | NULL |
SO31518 | 997 | NULL |
SO31519 | 870 | On Promotion |
SO31519 | 999 | Price |
Getting a list of all products ordered for each order is a trivial affair:
SELECT soh.SalesOrderNumber, sod.ProductID
FROM
INNER JOIN
ON sod.SalesOrderID = soh.SalesOrderID
— For testing, use only six SalesOrders
WHERE soh.SalesOrderID BETWEEN 51315 AND 51320
ORDER BY soh.SalesOrderNumber, sod.ProductID;
And getting a list of all reasons for an order is only slightly more complicated:
SELECT soh.SalesOrderNumber, sr.[Name] AS Reason
FROM
LEFT JOIN
ON sohsr.SalesOrderID = soh.SalesOrderID
LEFT JOIN
ON sr.SalesReasonID = sohsr.SalesReasonID
— For testing, use only six SalesOrders
WHERE soh.SalesOrderID BETWEEN 51315 AND 51320
ORDER BY soh.SalesOrderNumber, sr.[Name];
(Note the use of an OUTER JOIN is required, because some orders have no reasons associated with them.)
But how to combine these two lists into a single query? Of course, we all know that the correct answer is: “Don’t. That’s formatting, and formatting is best handled at the front end.” It’s even incredibly easy to do this in the front end. Just open two connections to SQL Server (or get really funky and use MARS on a single connection J), send the two queries above to the server and use the standard balanced line algorithm to process the incoming rows. Easy. And the client will still have time to run some idle cycles while waiting for the next row to come in over your network.
But the correct answer is not always the practical one. What if the manager needs his report NOW, but there’s no one with ASP of VB skills available? In that case, I will use the database for formatting, since I happen to be better in writing SQL than in writing VB. So I will have to combine these two queries to get a single report as output. And that brings me right back to the question HOW to do that.
A common mistake is to just combine the two queries into this single query:
SELECT soh.SalesOrderNumber, sod.ProductID, sr.[Name] AS Reason
FROM
INNER JOIN
ON sod.SalesOrderID = soh.SalesOrderID
LEFT JOIN
ON sohsr.SalesOrderID = soh.SalesOrderID
LEFT JOIN
ON sr.SalesReasonID = sohsr.SalesReasonID
— For testing, use only six SalesOrders
WHERE soh.SalesOrderID BETWEEN 51315 AND 51320
ORDER BY soh.SalesOrderNumber, sod.ProductID, sr.[Name];
But that doesn’t give me the results I need. For example, SalesOrder SO51319 is listed nine times instead of three. This is actually quite logical if you think about it – there are three products included in this order and three reasons, and since I didn’t specify any additional criteria in the JOIN clause, each of the products is combined to each of the reasons for a total of 3 * 3 = 9 rows.
If I want to get just the three rows, I’ll have to extend the JOIN clause so as to ensure that the “first” product gets joined to the “first” reason, the “second” product to the “second” reason, etc. But how do I (or rather: how does the database) know what the “first” product and the “first” reason are? The answer is that I must supply my own definition of “first”, ”second”, etc, then write the SQL to calculate these rankings. For this example, let’s use the order of the ORDER BY clause in the queries above to assign ranks. So the lowest numbered product and the alphabetically first reason get a rank of one, etc.
In SQL Server 2000, calculating a rank required the use of a subquery. SQL Server 2005 introduces the cool new RANK() function that I can use instead. I’ll get straight to the code, both for SQL Server 2005 and for SQL Server 2000, with some comments below.
— Syntax for SQL Server 2005 only – uses the new RANK() function
SELECT soh.SalesOrderNumber, sod.ProductID, sr.[Name] AS Reason
FROM (SELECT sod1.SalesOrderID, sod1.ProductID,
RANK() OVER (PARTITION BY sod1.SalesOrderID
ORDER BY sod1.ProductID) AS Rank
FROM
FULL JOIN (SELECT sohsr1.SalesOrderID, sr1.[Name],
RANK() OVER (PARTITION BY sohsr1.SalesOrderID
ORDER BY sr1.[Name]) AS Rank
FROM Sales.SalesOrderHeaderSalesReason AS sohsr1
INNER JOIN Sales.SalesReason AS sr1
ON sr1.SalesReasonID = sohsr1.SalesReasonID) AS sr
ON sr.SalesOrderID = sod.SalesOrderID
AND sr.Rank = sod.Rank
INNER JOIN
ON soh.SalesOrderID = COALESCE (sod.SalesOrderID, sr.SalesOrderID)
— For testing, use only six SalesOrders
WHERE soh.SalesOrderID BETWEEN 51315 AND 51320
ORDER BY soh.SalesOrderID, COALESCE(sr.Rank, sod.Rank);
— Syntax for SQL Server 7.0 and up
SELECT soh.SalesOrderNumber, sod.ProductID, sr.[Name] AS Reason
FROM (SELECT sod1.SalesOrderID, sod1.ProductID,
(SELECT COUNT()
FROM
WHERE sod2.SalesOrderID = sod1.SalesOrderID
AND sod2.ProductID <= sod1.ProductID) AS Rank
FROM
FULL JOIN (SELECT sohsr1.SalesOrderID, sr1.[Name],
(SELECT COUNT()
FROM
JOIN
ON sr2.SalesReasonID = sohsr2.SalesReasonID
WHERE sohsr2.SalesOrderID = sohsr1.SalesOrderID
AND sr2.[Name] <= sr1.[Name]) AS Rank
FROM
INNER JOIN Sales.SalesReason AS sr1
ON sr1.SalesReasonID = sohsr1.SalesReasonID) AS sr
ON sr.SalesOrderID = sod.SalesOrderID
AND sr.Rank = sod.Rank
INNER JOIN
ON soh.SalesOrderID = COALESCE (sod.SalesOrderID, sr.SalesOrderID)
— For testing, use only six SalesOrders
WHERE soh.SalesOrderID BETWEEN 51315 AND 51320
ORDER BY soh.SalesOrderID, COALESCE(sr.Rank, sod.Rank);
Both versions of the query use two derived tables. And each of those derived tables contains the output of one of the two queries I started with, plus an additional column to hold the rank (based on the ordering I defined). These two derived tables are the joined together on identical SalesOrderID and identical rank, using a FULL OUTER JOIN. Using a LEFT OUTER JOIN or RIGHT OUTER JOIN would not have been sufficient here, since there are orders with more products than reasons or even no reasons at all (e.g. SO15318, SO15320), but also orders with more reasons than products (e.g. SO51924 [not included in the test set!]). The result of this outer join is then combined (using INNER JOIN – that is safe since each order always has at least one product) with the SalesOrderHeader table to get the order number.
Okay, so we now have got a working solution. Let’s check how it performs, ‘kay? So I removed the WHERE clause from all my queries, added an INTO #Dummy clause to make sure I tested the speed of the server, not the speed of my front end (SSMS) formatting and displaying the data, and enclosed each of the queries in a loop to execute them five times (one execution with cold cache, after using CHECKPOINT, DBCC DROPCLEANBUFFERS, and DBCC FREEPROCCACHE; the other four with hot cache). Here are the rounded results when testing on my system, running SQL Server 2005 on Windows XP Professional. (Note that results on other systems will probably be different!)
- Just the two basic queries, assuming that the front end will have ample time to perform the balanced line processing while waiting for the next row to arrive – 3.5 seconds.
- The SQL Server 2005 version with the RANK() function – 4 seconds.
- The SQL Server 7.0 and later version without RANK() – 14 seconds.
Conclusion: Doing this kind of formatting client side is still preferred. The queries are easier to grasp and hence easier to maintain, and it’s the method that gives the best performance. But if you have to do it server side, make sure to use the new RANK() function if you’re already on SQL Server 2005, rather than the version with subqueries. . Not only are they easier to read, It performs MUCH better than the subqueries that are required for the older versions of SQL Server, and it’s a lot easier to read.
3 Comments. Leave new
Hi Hugo,
Very interesting. You might be interested in my article on several similar problems:
http://www-128.ibm.com/developerworks/db2/library/techarticle/dm-0401kuznetsov/index.html
Unfortunately, not all OLPA functions are implemented in SQL 2004 yet.
Hi Hugo,
One more thing: if there are duplicates, RANK() has gaps and ROW_NUMBER() is better suited for the task. i think duplicates are quite possible in your situation. For instance, I bought by mail bags for my vacuum cleaner more than once from one and the same company.
Hi Alexander,
You are absolutely right about using ROW_NUMBER rather than RANK if duplicates are possible. Thanks for the addition.
This also shows another bonus of the new ranking functions. The SQL Server 2000 version of my query would fail if duplicates were allowed in SalesOrderDetail and/or SalesOrderHeaderSalesReason, just like the SQL Server 2005 version with RANK(). But whereas the latter version is easily fixed with ROW_NUMBER, there’s no easy fix for the SQL Server 2000 version.
Thanks for your comments (and for the link to your great article)!