Relating unrelated rows

Relating unrelated rows

A recurring question in the newsgroups is: “How do I output unrelated rows side by side?”

 

I’ll use the AdventureWorks database as an example. Suppose that my manager wants a report listing, for each order, all products ordered and all reasons the customer has for ordering from AdventureWorks. Here’s a sample of how the report should look.

 

SalesOrderNumber

ProductID

Reason

SO51318

858

Other

SO31518

870

Price

SO31518

872

NULL

SO31518

997

NULL

SO31519

870

On Promotion

SO31519

999

Price

 

Getting a list of all products ordered for each order is a trivial affair:

 

SELECT      soh.SalesOrderNumber, sod.ProductID

FROM        Sales.SalesOrderHeader AS soh

INNER JOIN  Sales.SalesOrderDetail AS sod

      ON    sod.SalesOrderID = soh.SalesOrderID

— For testing, use only six SalesOrders

WHERE       soh.SalesOrderID BETWEEN 51315 AND 51320

ORDER BY    soh.SalesOrderNumber, sod.ProductID;

 

And getting a list of all reasons for an order is only slightly more complicated:

 

SELECT      soh.SalesOrderNumber, sr.[Name] AS Reason

FROM        Sales.SalesOrderHeader AS soh

LEFT  JOIN  Sales.SalesOrderHeaderSalesReason AS sohsr

      ON    sohsr.SalesOrderID = soh.SalesOrderID

LEFT  JOIN  Sales.SalesReason AS sr

      ON    sr.SalesReasonID = sohsr.SalesReasonID

— For testing, use only six SalesOrders

WHERE       soh.SalesOrderID BETWEEN 51315 AND 51320

ORDER BY    soh.SalesOrderNumber, sr.[Name];

 

(Note the use of an OUTER JOIN is required, because some orders have no reasons associated with them.)

 

But how to combine these two lists into a single query? Of course, we all know that the correct answer is: “Don’t. That’s formatting, and formatting is best handled at the front end.” It’s even incredibly easy to do this in the front end. Just open two connections to SQL Server (or get really funky and use MARS on a single connection J), send the two queries above to the server and use the standard balanced line algorithm to process the incoming rows. Easy. And the client will still have time to run some idle cycles while waiting for the next row to come in over your network.

 

But the correct answer is not always the practical one. What if the manager needs his report NOW, but there’s no one with ASP of VB skills available? In that case, I will use the database for formatting, since I happen to be better in writing SQL than in writing VB. So I will have to combine these two queries to get a single report as output. And that brings me right back to the question HOW to do that.

 

A common mistake is to just combine the two queries into this single query:

 

SELECT      soh.SalesOrderNumber, sod.ProductID, sr.[Name] AS Reason

FROM        Sales.SalesOrderHeader AS soh

INNER JOIN  Sales.SalesOrderDetail AS sod

      ON    sod.SalesOrderID = soh.SalesOrderID

LEFT  JOIN  Sales.SalesOrderHeaderSalesReason AS sohsr

      ON    sohsr.SalesOrderID = soh.SalesOrderID

LEFT  JOIN  Sales.SalesReason AS sr

      ON    sr.SalesReasonID = sohsr.SalesReasonID

— For testing, use only six SalesOrders

WHERE       soh.SalesOrderID BETWEEN 51315 AND 51320

ORDER BY    soh.SalesOrderNumber, sod.ProductID, sr.[Name];

 

But that doesn’t give me the results I need. For example, SalesOrder SO51319 is listed nine times instead of three. This is actually quite logical if you think about it – there are three products included in this order and three reasons, and since I didn’t specify any additional criteria in the JOIN clause, each of the products is combined to each of the reasons for a total of 3 * 3 = 9 rows.

 

If I want to get just the three rows, I’ll have to extend the JOIN clause so as to ensure that the “first” product gets joined to the “first” reason, the “second” product to the “second” reason, etc. But how do I (or rather: how does the database) know what the “first” product and the “first” reason are? The answer is that I must supply my own definition of “first”, ”second”, etc, then write the SQL to calculate these rankings. For this example, let’s use the order of the ORDER BY clause in the queries above to assign ranks. So the lowest numbered product and the alphabetically first reason get a rank of one, etc.

 

In SQL Server 2000, calculating a rank required the use of a subquery. SQL Server 2005 introduces the cool new RANK() function that I can use instead. I’ll get straight to the code, both for SQL Server 2005 and for SQL Server 2000, with some comments below.

 

— Syntax for SQL Server 2005 only – uses the new RANK() function

SELECT      soh.SalesOrderNumber, sod.ProductID, sr.[Name] AS Reason

FROM       (SELECT     sod1.SalesOrderID, sod1.ProductID,

                       RANK() OVER (PARTITION BY sod1.SalesOrderID

                                    ORDER BY sod1.ProductID) AS Rank

            FROM       Sales.SalesOrderDetail AS sod1) AS sod

FULL  JOIN (SELECT     sohsr1.SalesOrderID, sr1.[Name],

                       RANK() OVER (PARTITION BY sohsr1.SalesOrderID

                                    ORDER BY sr1.[Name]) AS Rank

            FROM       Sales.SalesOrderHeaderSalesReason AS sohsr1

            INNER JOIN Sales.SalesReason AS sr1

                  ON   sr1.SalesReasonID = sohsr1.SalesReasonID) AS sr

      ON    sr.SalesOrderID = sod.SalesOrderID

      AND   sr.Rank = sod.Rank

INNER JOIN  Sales.SalesOrderHeader AS soh

      ON    soh.SalesOrderID = COALESCE (sod.SalesOrderID, sr.SalesOrderID)

— For testing, use only six SalesOrders

WHERE       soh.SalesOrderID BETWEEN 51315 AND 51320

ORDER BY    soh.SalesOrderID, COALESCE(sr.Rank, sod.Rank);

 

— Syntax for SQL Server 7.0 and up

SELECT      soh.SalesOrderNumber, sod.ProductID, sr.[Name] AS Reason

FROM       (SELECT     sod1.SalesOrderID, sod1.ProductID,

                      (SELECT COUNT()

                       FROM   Sales.SalesOrderDetail AS sod2

                       WHERE  sod2.SalesOrderID = sod1.SalesOrderID

                       AND    sod2.ProductID <= sod1.ProductID) AS Rank

            FROM       Sales.SalesOrderDetail AS sod1) AS sod

FULL  JOIN (SELECT     sohsr1.SalesOrderID, sr1.[Name],

                      (SELECT COUNT()

                       FROM   Sales.SalesOrderHeaderSalesReason AS sohsr2

                       JOIN   Sales.SalesReason AS sr2

                         ON   sr2.SalesReasonID = sohsr2.SalesReasonID

                       WHERE  sohsr2.SalesOrderID = sohsr1.SalesOrderID

                       AND    sr2.[Name] <= sr1.[Name]) AS Rank

            FROM       Sales.SalesOrderHeaderSalesReason AS sohsr1

            INNER JOIN Sales.SalesReason AS sr1

                  ON   sr1.SalesReasonID = sohsr1.SalesReasonID) AS sr

      ON    sr.SalesOrderID = sod.SalesOrderID

      AND   sr.Rank = sod.Rank

INNER JOIN  Sales.SalesOrderHeader AS soh

      ON    soh.SalesOrderID = COALESCE (sod.SalesOrderID, sr.SalesOrderID)

— For testing, use only six SalesOrders

WHERE       soh.SalesOrderID BETWEEN 51315 AND 51320

ORDER BY    soh.SalesOrderID, COALESCE(sr.Rank, sod.Rank);

 

Both versions of the query use two derived tables. And each of those derived tables contains the output of one of the two queries I started with, plus an additional column to hold the rank (based on the ordering I defined). These two derived tables are the joined together on identical SalesOrderID and identical rank, using a FULL OUTER JOIN. Using a LEFT OUTER JOIN or RIGHT OUTER JOIN would not have been sufficient here, since there are orders with more products than reasons or even no reasons at all (e.g. SO15318, SO15320), but also orders with more reasons than products (e.g. SO51924 [not included in the test set!]). The result of this outer join is then combined (using INNER JOIN – that is safe since each order always has at least one product) with the SalesOrderHeader table to get the order number.

 

Okay, so we now have got a working solution. Let’s check how it performs, ‘kay? So I removed the WHERE clause from all my queries, added an INTO #Dummy clause to make sure I tested the speed of the server, not the speed of my front end (SSMS) formatting and displaying the data, and enclosed each of the queries in a loop to execute them five times (one execution with cold cache, after using CHECKPOINT, DBCC DROPCLEANBUFFERS, and DBCC FREEPROCCACHE; the other four with hot cache). Here are the rounded results when testing on my system, running SQL Server 2005 on Windows XP Professional. (Note that results on other systems will probably be different!)

 

  • Just the two basic queries, assuming that the front end will have ample time to perform the balanced line processing while waiting for the next row to arrive – 3.5 seconds.
  • The SQL Server 2005 version with the RANK() function – 4 seconds.
  • The SQL Server 7.0 and later version without RANK() – 14 seconds.

 

Conclusion: Doing this kind of formatting client side is still preferred. The queries are easier to grasp and hence easier to maintain, and it’s the method that gives the best performance. But if you have to do it server side, make sure to use the new RANK() function if you’re already on SQL Server 2005, rather than the version with subqueries. . Not only are they easier to read, It performs MUCH better than the subqueries that are required for the older versions of SQL Server, and it’s a lot easier to read.

Snapshot isolation: A threat for integrity? (Part 1)

Related Posts

No results found.

3 Comments. Leave new

  • Alexander Kuznetsov
    July 30, 2006 01:17

    Hi Hugo,

    Very interesting. You might be interested in my article on several similar problems:

    http://www-128.ibm.com/developerworks/db2/library/techarticle/dm-0401kuznetsov/index.html

    Unfortunately, not all OLPA functions are implemented in SQL 2004 yet.

    Reply
  • Alexander Kuznetsov
    July 31, 2006 00:16

    Hi Hugo,

    One more thing: if there are duplicates, RANK() has gaps and ROW_NUMBER() is better suited for the task. i think duplicates are quite possible in your situation. For instance, I bought by mail bags for my vacuum cleaner more than once from one and the same company.

    Reply
  • Hugo Kornelis
    August 12, 2006 22:00

    Hi Alexander,

    You are absolutely right about using ROW_NUMBER rather than RANK if duplicates are possible. Thanks for the addition.

    This also shows another bonus of the new ranking functions. The SQL Server 2000 version of my query would fail if duplicates were allowed in SalesOrderDetail and/or SalesOrderHeaderSalesReason, just like the SQL Server 2005 version with RANK(). But whereas the latter version is easily fixed with ROW_NUMBER, there’s no easy fix for the SQL Server 2000 version.

    Thanks for your comments (and for the link to your great article)!

    Reply

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.

Close