Relating unrelated rows

18 years ago

A recurring question in the newsgroups is: “How do I output unrelated rows side by side?”

I’ll use the AdventureWorks database as an example. Suppose that my manager wants a report listing, for each order, all products ordered and all reasons the customer has for ordering from AdventureWorks. Here’s a sample of how the report should look.

SalesOrderNumber	ProductID	Reason
SO51318	858	Other
SO31518	870	Price
SO31518	872	NULL
SO31518	997	NULL
SO31519	870	On Promotion
SO31519	999	Price

Getting a list of all products ordered for each order is a trivial affair:

SELECT soh.SalesOrderNumber, sod.ProductID

FROM Sales.SalesOrderHeader AS soh

INNER JOIN Sales.SalesOrderDetail AS sod

ON sod.SalesOrderID = soh.SalesOrderID

— For testing, use only six SalesOrders

WHERE soh.SalesOrderID BETWEEN 51315 AND 51320

ORDER BY soh.SalesOrderNumber, sod.ProductID;

And getting a list of all reasons for an order is only slightly more complicated:

SELECT soh.SalesOrderNumber, sr.[Name] AS Reason

FROM Sales.SalesOrderHeader AS soh

LEFT JOIN Sales.SalesOrderHeaderSalesReason AS sohsr

ON sohsr.SalesOrderID = soh.SalesOrderID

LEFT JOIN Sales.SalesReason AS sr

ON sr.SalesReasonID = sohsr.SalesReasonID

— For testing, use only six SalesOrders

WHERE soh.SalesOrderID BETWEEN 51315 AND 51320

ORDER BY soh.SalesOrderNumber, sr.[Name];

(Note the use of an OUTER JOIN is required, because some orders have no reasons associated with them.)

But how to combine these two lists into a single query? Of course, we all know that the correct answer is: “Don’t. That’s formatting, and formatting is best handled at the front end.” It’s even incredibly easy to do this in the front end. Just open two connections to SQL Server (or get really funky and use MARS on a single connection J), send the two queries above to the server and use the standard balanced line algorithm to process the incoming rows. Easy. And the client will still have time to run some idle cycles while waiting for the next row to come in over your network.

But the correct answer is not always the practical one. What if the manager needs his report NOW, but there’s no one with ASP of VB skills available? In that case, I will use the database for formatting, since I happen to be better in writing SQL than in writing VB. So I will have to combine these two queries to get a single report as output. And that brings me right back to the question HOW to do that.

A common mistake is to just combine the two queries into this single query:

SELECT soh.SalesOrderNumber, sod.ProductID, sr.[Name] AS Reason

FROM Sales.SalesOrderHeader AS soh

INNER JOIN Sales.SalesOrderDetail AS sod

ON sod.SalesOrderID = soh.SalesOrderID

LEFT JOIN Sales.SalesOrderHeaderSalesReason AS sohsr

ON sohsr.SalesOrderID = soh.SalesOrderID

LEFT JOIN Sales.SalesReason AS sr

ON sr.SalesReasonID = sohsr.SalesReasonID

— For testing, use only six SalesOrders

WHERE soh.SalesOrderID BETWEEN 51315 AND 51320

ORDER BY soh.SalesOrderNumber, sod.ProductID, sr.[Name];

But that doesn’t give me the results I need. For example, SalesOrder SO51319 is listed nine times instead of three. This is actually quite logical if you think about it – there are three products included in this order and three reasons, and since I didn’t specify any additional criteria in the JOIN clause, each of the products is combined to each of the reasons for a total of 3 * 3 = 9 rows.

If I want to get just the three rows, I’ll have to extend the JOIN clause so as to ensure that the “first” product gets joined to the “first” reason, the “second” product to the “second” reason, etc. But how do I (or rather: how does the database) know what the “first” product and the “first” reason are? The answer is that I must supply my own definition of “first”, ”second”, etc, then write the SQL to calculate these rankings. For this example, let’s use the order of the ORDER BY clause in the queries above to assign ranks. So the lowest numbered product and the alphabetically first reason get a rank of one, etc.

In SQL Server 2000, calculating a rank required the use of a subquery. SQL Server 2005 introduces the cool new RANK() function that I can use instead. I’ll get straight to the code, both for SQL Server 2005 and for SQL Server 2000, with some comments below.

— Syntax for SQL Server 2005 only – uses the new RANK() function

SELECT soh.SalesOrderNumber, sod.ProductID, sr.[Name] AS Reason

FROM (SELECT sod1.SalesOrderID, sod1.ProductID,

RANK() OVER (PARTITION BY sod1.SalesOrderID

ORDER BY sod1.ProductID) AS Rank

FROM Sales.SalesOrderDetail AS sod1) AS sod

FULL JOIN (SELECT sohsr1.SalesOrderID, sr1.[Name],

RANK() OVER (PARTITION BY sohsr1.SalesOrderID

ORDER BY sr1.[Name]) AS Rank

FROM Sales.SalesOrderHeaderSalesReason AS sohsr1

INNER JOIN Sales.SalesReason AS sr1

ON sr1.SalesReasonID = sohsr1.SalesReasonID) AS sr

ON sr.SalesOrderID = sod.SalesOrderID

AND sr.Rank = sod.Rank

INNER JOIN Sales.SalesOrderHeader AS soh

ON soh.SalesOrderID = COALESCE (sod.SalesOrderID, sr.SalesOrderID)

— For testing, use only six SalesOrders

WHERE soh.SalesOrderID BETWEEN 51315 AND 51320

ORDER BY soh.SalesOrderID, COALESCE(sr.Rank, sod.Rank);

— Syntax for SQL Server 7.0 and up

SELECT soh.SalesOrderNumber, sod.ProductID, sr.[Name] AS Reason

FROM (SELECT sod1.SalesOrderID, sod1.ProductID,

(SELECT COUNT()

                       FROM   Sales.SalesOrderDetail AS sod2

                       WHERE sod2.SalesOrderID = sod1.SalesOrderID

                       AND    sod2.ProductID <= sod1.ProductID) AS Rank

            FROM       Sales.SalesOrderDetail AS sod1) AS sod

FULL JOIN (SELECT     sohsr1.SalesOrderID, sr1.[Name],

(SELECT COUNT()

FROM Sales.SalesOrderHeaderSalesReason AS sohsr2

JOIN Sales.SalesReason AS sr2

ON sr2.SalesReasonID = sohsr2.SalesReasonID

WHERE sohsr2.SalesOrderID = sohsr1.SalesOrderID

AND sr2.[Name] <= sr1.[Name]) AS Rank

FROM Sales.SalesOrderHeaderSalesReason AS sohsr1

INNER JOIN Sales.SalesReason AS sr1

ON sr1.SalesReasonID = sohsr1.SalesReasonID) AS sr

ON sr.SalesOrderID = sod.SalesOrderID

AND sr.Rank = sod.Rank

INNER JOIN Sales.SalesOrderHeader AS soh

ON soh.SalesOrderID = COALESCE (sod.SalesOrderID, sr.SalesOrderID)

— For testing, use only six SalesOrders

WHERE soh.SalesOrderID BETWEEN 51315 AND 51320

ORDER BY soh.SalesOrderID, COALESCE(sr.Rank, sod.Rank);

Both versions of the query use two derived tables. And each of those derived tables contains the output of one of the two queries I started with, plus an additional column to hold the rank (based on the ordering I defined). These two derived tables are the joined together on identical SalesOrderID and identical rank, using a FULL OUTER JOIN. Using a LEFT OUTER JOIN or RIGHT OUTER JOIN would not have been sufficient here, since there are orders with more products than reasons or even no reasons at all (e.g. SO15318, SO15320), but also orders with more reasons than products (e.g. SO51924 [not included in the test set!]). The result of this outer join is then combined (using INNER JOIN – that is safe since each order always has at least one product) with the SalesOrderHeader table to get the order number.

Okay, so we now have got a working solution. Let’s check how it performs, ‘kay? So I removed the WHERE clause from all my queries, added an INTO #Dummy clause to make sure I tested the speed of the server, not the speed of my front end (SSMS) formatting and displaying the data, and enclosed each of the queries in a loop to execute them five times (one execution with cold cache, after using CHECKPOINT, DBCC DROPCLEANBUFFERS, and DBCC FREEPROCCACHE; the other four with hot cache). Here are the rounded results when testing on my system, running SQL Server 2005 on Windows XP Professional. (Note that results on other systems will probably be different!)

Just the two basic queries, assuming that the front end will have ample time to perform the balanced line processing while waiting for the next row to arrive – 3.5 seconds.
The SQL Server 2005 version with the RANK() function – 4 seconds.
The SQL Server 7.0 and later version without RANK() – 14 seconds.

Conclusion: Doing this kind of formatting client side is still preferred. The queries are easier to grasp and hence easier to maintain, and it’s the method that gives the best performance. But if you have to do it server side, make sure to use the new RANK() function if you’re already on SQL Server 2005, rather than the version with subqueries. . Not only are they easier to read, It performs MUCH better than the subqueries that are required for the older versions of SQL Server, and it’s a lot easier to read.

Hugo Kornelis

Hugo Kornelis is an established SQL Server Community expert. He writes, blogs, speaks, tech edits, and researches, focusing mostly on SQL Server performance and execution plans. He was the technical editor for the third edition of Grant Fritchey's "SQL Server Execution Plans". In 2018 he started a project to document all behaviour of SQL Server execution plans at his website, "the SQL Server Execution Plan Reference" (https://sqlserverfast.com/epr) When not working for the community, Hugo is busy at his day job: freelance database developer/consultant. Hugo has over 20 years of experience on SQL Server in various roles. He has a strong database design background but has since specialized into query tuning and execution plans.

Snapshot isolation: A threat for integrity? (Part 1)

3 Comments. Leave new

Alexander Kuznetsov
July 30, 2006 01:17

Hi Hugo,

Very interesting. You might be interested in my article on several similar problems:

http://www-128.ibm.com/developerworks/db2/library/techarticle/dm-0401kuznetsov/index.html

Unfortunately, not all OLPA functions are implemented in SQL 2004 yet.

Loading...

Reply
Alexander Kuznetsov
July 31, 2006 00:16

Hi Hugo,

One more thing: if there are duplicates, RANK() has gaps and ROW_NUMBER() is better suited for the task. i think duplicates are quite possible in your situation. For instance, I bought by mail bags for my vacuum cleaner more than once from one and the same company.

Loading...

Reply
Hugo Kornelis
August 12, 2006 22:00

Hi Alexander,

You are absolutely right about using ROW_NUMBER rather than RANK if duplicates are possible. Thanks for the addition.

This also shows another bonus of the new ranking functions. The SQL Server 2000 version of my query would fail if duplicates were allowed in SalesOrderDetail and/or SalesOrderHeaderSalesReason, just like the SQL Server 2005 version with RANK(). But whereas the latter version is easily fixed with ROW_NUMBER, there’s no easy fix for the SQL Server 2000 version.

Thanks for your comments (and for the link to your great article)!

Loading...

Reply

Relating unrelated rows

Relating unrelated rows

Like this:

Related Posts

3 Comments. Leave new

Leave a ReplyCancel reply

Hugo Kornelis

Subscribe via Email

Meta

Archives

Relating unrelated rows

Relating unrelated rows

Share this:

Like this:

Related Posts

3 Comments. Leave new

Leave a ReplyCancel reply

Hugo Kornelis

Subscribe via Email

Meta

Archives