The many (too many?) reads of a many to many Merge Join

The many (too many?) reads of a many to many Merge Join

SQL Server’s optimizer can choose between four different physical operators for joining data. The Merge Join operator is usually the best choice if the input data is already sorted by the join columns. It processes both inputs at the same time, every time advancing the input that has the lowest value of the join key. This algorithm, known as “balance line processing” back in the olden days of magnetic tape storage, always finds all matches with just a single pass over each of the inputs.

One to many

One of the most important properties in a Merge Join operator is Many to Many. Far too few people are aware how important this property is. Ideally, the value of this property is False. In that case, the Merge Join runs in “one to many” mode, which is every bit as efficient as the introduction suggests. To see an example of this ideal case, run the query below in the AdventureWorks sample database:

SELECT     sod.ProductID,
           sod.OrderQty,
           soh.ShipDate
FROM       Sales.SalesOrderDetail AS sod
INNER JOIN Sales.SalesOrderHeader AS soh
      ON   soh.SalesOrderID = sod.SalesOrderID;

The execution plan for this query looks like this. Both input tables are scanned. That makes sense: there are no filters in the query, so we need all rows. Scanning is the optimal way to read all rows. A Merge Join operator combines the inputs. If you look at the properties of the Merge Join, you’ll see that the Many to Many property is False: this is a one to many join.

You can also see that the optimizer has switched the order of the two inputs. For a one to many Merge Join, the input without duplicates has to be the top input. In this case, it is the primary key constraint on SalesOrderID in the SalesOrderHeader table. It guarantees that there are no duplicate values, therefore this input had to be on top.

As stated above, the Merge Join algorithm always advances the input with the lower value in the join columns. If the values are equal, it always advances the bottom input, the one that can have duplicates. If the next row does indeed have the same SalesOrderID as the previous row, it will still match with the current row from the top input. The top input only advances when the bottom input has reached the next higher value. Because the top input has no duplicates, advancing it will also produce the next higher value from that input.

Many to many

The Merge Join operator has complete symmetry in the supported logical join operations. Whether outer, semi, or anti semi – both the left and the right versions work. This means that there is no reason to worry about “many to one” (as opposed to “one to many”): the optimizer would simply reverse the order of the inputs. When both inputs have guaranteed uniqueness (“one to one”), then either input can be at the top side and it will work. But what if we have no uniqueness guarantee in either of the inputs? The query below illustrates this.

SELECT     pv.BusinessEntityID,
           pv.StandardPrice,
           sod.ProductID,
           sod.SalesOrderID
FROM       Sales.SalesOrderDetail   AS sod
INNER JOIN Purchasing.ProductVendor AS pv
      ON   sod.ProductID = pv.ProductID;

The execution plan for this query looks quite similar to the one we saw before. However, in this case the Merge Join has a very significant difference, only visible in its properties. The Many to Many property is True in this case, This execution plan is far less efficient than the previous example.

Before going into the reasons why a many to many join is less efficient, let’s first check to see what would happen if we were to use the same algorithm. At one point, it finds a match. Both inputs have the same ProductID value, so the bottom input advances to find the next ProductVendor. If this is for the same ProductID, it is another match with the current row from SalesOrderDetail and the ProductVendor input will once more be advanced. This continues until we hit a new ProductID in ProductVendor. At that point the next row from SalesOrderDetail is read. But what it that row has the same ProductID as the previous row? The bottom input has just advanced past all the matching rows, that are now actually needed again. But execution plan operators do not support a GetPrevious() method; only GetNext() exists. How does the Merge Join produce the required join results in this case? The answer is that it stores the data it might need again in tempdb.

You can see this when you rerun the queries with SET STATISTICS IO ON. The query with a one to many Merge Join shows only the logical reads for the two input tables. However, the query with a many to many Merge Join has a third line of output, for a “Worktable”. This represents the scratch area that the Merge Join operator has to use because there is no other way to retrieve “earlier” rows when the same join column value is repeated.

Table 'Worktable'. Scan count 19, logical reads 18013, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'ProductVendor'. Scan count 1, logical reads 7, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'SalesOrderDetail'. Scan count 1, logical reads 244, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

But why so many?

What surprises me in the statistics io output of the last example is the number of reads for the worktable. And why 18,013 logical reads? Doesn’t that seem a bit excessive in comparison to the amount of data processed? And why 19 scans, where does this number come from?

One way to find out is to look what happens when we change the data. For instance, how will this number change if we add one extra row to the ProductVendor table before running the query? Here is the code I used to try this:

BEGIN TRANSACTION;

INSERT INTO Purchasing.ProductVendor
           (ProductID,   BusinessEntityID, AverageLeadTime, StandardPrice,
            MinOrderQty, MaxOrderQty,      UnitMeasureCode, ModifiedDate)
VALUES     (906,         1492,             1,               0.01,
            1,           1,                N'PC',           CURRENT_TIMESTAMP);

SET STATISTICS IO ON;

SELECT     pv.BusinessEntityID,
           pv.StandardPrice,
           sod.ProductID,
           sod.SalesOrderID
FROM       Sales.SalesOrderDetail   AS sod
INNER JOIN Purchasing.ProductVendor AS pv
      ON   sod.ProductID = pv.ProductID;

ROLLBACK TRANSACTION;

I executed this code multiple times with different values for ProductID in the inserted row. The effect on the statistics io output are in the table below. (For the record, I actually ran many more tests, with different values, different tables, other queries, etc.; these are just the most interesting results).

ProductID Scans Logical reads
906 19 18,013
907 20 18,245
908 19 18,015

So … adding one row to the ProductVendor table can leave the results unchanged, or add just 2 logical reads, or add a scan and 232 logical reads. How does this makes sense? It actually does if you also look at the existing data in the database. There currently are no rows for ProductID 906, one row for 907, and two rows for 908. So adding a row that is not a duplicate has no effect on the worktable, adding a row to cause a duplicate increases the scan count and causes a lot of extra logical reads, and adding further duplicates for an already duplicated value only adds two logical reads. This gives me the information I need to understand how this works.

The explanation

Adding a non-duplicate does not affect the worktable. Until today I thought that a many to many Merge Join simply copies all rows from one output to the worktable, in case they are needed. Now I know I was wrong. Only actual duplicates in the bottom input cause worktable activity. Apparently, when a many to many Merge Join finds a new value in the bottom input it peeks ahead at the value after that to see if this new value is a duplicate or not. If it is, it copies rows from the top input to the worktable as it processes them. The data in the worktable can then be read for all other rows in the bottom input with the same value in the join columns. If it is not a duplicate, then the worktable is not touched at all. I was able to confirm this by creating a query that does a many to many Merge Join where the bottom input could have duplicates but in reality didn’t. The statistics io output showed zero scans and zero logical reads.

Adding an extra row with an already duplicated value does impact worktable activity, but only for an additional two logical reads. The row was already a duplicate. The values from the top input already are in the worktable. By adding one extra row in the bottom input, I caused one extra pass over the data in the worktable. That takes two logical reads, which makes perfect sense for reading the 153 SalesOrderDetails rows with ProductID 908. They all fit on a single page. The two reads are for the root page and the leaf page of the worktable’s clustered index.

Adding a row to create a duplicate has a much higher impact. When I added ProductID 907, the number of logical reads went up by 232. Two of those are produced when the duplicate row is processed, for reading the data from the worktable. That means that writing to the worktable incurs 230 logical reads. There are 113 SalesOrderDetail rows for ProductID 907, so it appears that the number of logical reads is equal to two reads per row added to the worktable, plus 4. Further experiments with other values confirm that this formula is correct. This leads me to the conclusion that the worktable used for a many to many Merge Join has the same internal structure as the worktable used by a Table Spool (see this blog post): it uses a clustered index that it has to traverse for every added row.

The number of scans for the worktable is always equal to the number of values with duplicates in the bottom input. In the original ProductVendor table, 19 products that are also in SalesOrderHeader have more than one vendor. How often these duplicated values appear does not impact the scan count. This seems counter-intuitive to me. I would expect to see one scan for every time the worktable serves a copy of the data. In reality the scan count goes up when the worktable is cleared out and filled with new data. I have no explanation for this. If anyone does: please use the comment section below to let me know!

Conclusion

Whenever I see something in an execution plan that I do not understand, I dive in and try to figure out what is happening. In this case, it was the high number of logical reads that triggered me to investigate. I now have a better understanding of how Merge Join works, and where these numbers come from.

This knowledge can come in handy. I already knew how important it is to have constraints, so that the optimizer has maximum information. Even when there are no actual duplicates in one of the inputs, if the optimizer thinks there might be duplicates it will use a many to many Merge Join. And though this does not write to tempdb unless there actually are duplicates, it still uses more overhead in the algorithm. Plus, the additional estimated cost of writing to and reading from the worktable can even cause the optimizer to reject a many to many Merge Join and use a Hash Join instead, whereas a one to many Merge Join would have been better.

In cases where you really do need a many to many Merge Join, the optimizer tries to have the input with the least amount of duplicated values as the bottom input, in order to reduce as far as possible the activity in tempdb. But the optimizer doesn’t always have a full understanding of your data. If you see a many to many Merge Join in an execution plan, check the statistics io to see how much tempdb activity it triggers. Verify if you think the optimizer made the right choice in which input to put top and which to put bottom. If you think that it made a suboptimal choice, verify (using a FORCE ORDER hint is the easiest way to verify in test). If you are right, then try to find ways to get the optimizer to change its mind. Do not use a hint for this unless you really see no other alternative. Option such as creating or updating statistics, rewriting the query, or ensuring that all known data rules are represented as constraints and that these constraints are actually enabled and trusted, should all be tried first.

Willkommen, Bienvenue, Welcome
Plansplaining, part 1. The unexpected aggregation and assert

Related Posts

No results found.

3 Comments. Leave new

  • HI Hugo,
    This is excellent, thanks for sharing,
    Storage Engine creates Clustered Index or just a Index (may be Non-Clustered ) structure,
    and can you share you view on structure creation timings does this on disk structured created in the middle when Merge Join start executing?

    Reply
  • Hugo Kornelis
    March 8, 2018 14:18

    Hi Neeraj,

    Thanks for your comment!

    I’ll try to answer your question but I am not sure that I actually understand what you are asking. I think you want to know when exactly a Merge Join creates a worktable and how it is structured?
    So in that case, the answer is: A “one to many” Merge Join will never create a worktable. A “many to many” Merge Join will always create a worktable. However, it will only write to it (and read from it) when there is an actual duplicate in the bottom input. So if the optimizer makes the join many to many but the bottom input does not really contain duplicates, you get a bit of overhead for creating the worktable, but no overhead of writing to and reading from it. When the bottom input actually does contain duplicates, you can get a lot of overhead (even if the top table happens to not have any duplicates).

    For the structure of the worktable: This is not documented, but based on my observations of the number of logical IOs reported by SET STATSTICS IO ON (both in tests with Merge Join as well as in tests with a Table Spool operator), I am 99% sure that the worktable is built as a clustered index on an internal column that behaves like an IDENTITY. No additional indexes are created for worktables, with one specific exception: when a worktable is created by an Index Spool operator, there is an additional (nonclustered) index.
    The clustered index (and the extra nonclustered index in the case of an Index Spool) are created when the worktable is created and maintained for every row added to (or removed from) the worktable.

    I hope this answers your questions!

    Reply
  • Thanks for quick response.

    Reply

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.

Close