This is the third post in the plansplaining series. Each of these blog posts focuses on a sample execution plan that exposes an uncommon and interesting pattern, and details exactly how that plan works.
In the first post, I covered each individual step of each operator in great detail, to make sure that everyone understands exactly how operators work in the pull-based execution plans. In this post (and all future installments), I will leave out the details that I now assume to be known to my readers. If you did not read part 1 already, I suggest you start there.
Today I will look at a pattern that can be seen when working with partitioned tables. I will not use a standard demo database but create my own set of sample tables to get maximum control. Here is the code to create two partitioned tables and load them with some data. Execute this code in your own “playground” database if you want to follow along.
USE Playground; GO SET NOCOUNT ON; -- Create three partitioned demo tables on two partition functions CREATE PARTITION FUNCTION pf1(int) AS RANGE LEFT FOR VALUES (25, 50, 75); CREATE PARTITION FUNCTION pf2(int) AS RANGE RIGHT FOR VALUES (25, 50, 75); CREATE PARTITION SCHEME ps1 AS PARTITION pf1 ALL TO ([PRIMARY]); CREATE PARTITION SCHEME ps2 AS PARTITION pf2 ALL TO ([PRIMARY]); CREATE TABLE dbo.Part_1a (a int NOT NULL, b int NOT NULL, c int NOT NULL, PRIMARY KEY (a, b)) ON ps1(a); CREATE TABLE dbo.Part_1b (a int NOT NULL, b int NOT NULL, c int NOT NULL, PRIMARY KEY (a, b)) ON ps1(a); CREATE TABLE dbo.Part_2 (a int NOT NULL, b int NOT NULL, c int NOT NULL, PRIMARY KEY (a, b)) ON ps2(a); -- Put in some sample data DECLARE @i int = 1; WHILE @i <= 100000 BEGIN; INSERT INTO dbo.Part_1a(a, b, c) VALUES (@i / 1000, @i % 1000, @i % 100) INSERT INTO dbo.Part_1b(a, b, c) VALUES (@i / 1000, @i % 1000, (@i + 1) % 100); INSERT INTO dbo.Part_2(a, b, c) VALUES (@i / 1000, @i % 1000, (@i + 1) % 100); SET @i += 5; END; GO
This may take some time to run (about 2:45 minutes on my laptop). Since it’s a one-time deal, I did not bother to optimize this. Do make sure to run this with the Include Actual Execution Plan option disabled, or you will generate 60,000 execution plans (which will probably crash your SSMS instance).
Once the tables are created and the data is loaded, let’s run a simple query that joins two of the demo tables:
SELECT MAX(p1.c - p2.c) FROM dbo.Part_1a AS p1 INNER JOIN dbo.Part_2 AS p2 ON p2.a = p1.a AND p2.c > p1.c;
The execution plan for this query is very simple and holds no surprises. There is no WHERE clause and the tables are fairly large, so a Nested Loops join would be a bad idea. A Merge Join would be an option, but the equality part of the predicate is on column a only, which is not unique in both tables; so this would be a many to many Merge Join, with all the associated overhead in tempdb. A Hash Match join is a better choice in this case, and that is indeed what the optimizer picks:
Now let’s see what happens if we make one small change to the query. Instead of joining tables Part_1a and Part_2 (both partitioned but on different partition functions), we’ll now join Part_1a and Part_1b (both partitioned and using the same partition function). Here is the changed query:
SELECT MAX(p1.c - p2.c) FROM dbo.Part_1a AS p1 INNER JOIN dbo.Part_1b AS p2 ON p2.a = p1.a AND p2.c > p1.c;
With basically the exact same query you would probably expect to see the exact same execution plan. But that would not make for a good Plansplaining post, so fortunately we do actually get a different execution plan this time:
Comparing this execution plan to the one above, we see that the exact same pattern of a Hash Match join processing two Clustered Index Scans is used here. But in this case that pattern is on the inner side of a Nested Loops join, which means it executes once for each row in the outer input (which is a Constant Scan operator that we’ll examine in more detail shortly). The properties of the Hash Match operator show us exactly how often this section of the plan was executed:
Upon seeing this, you would be forgiven for thinking that the optimizer has gone absolutely bonkers. Why is the optimizer creating a plan that, by the looks of it, appears to do the same work four times? Granted, since the aggregation asks for a MAX only there will be no concerns over correctness of results in this case. But repeating the same task four times? Really? Why?
The good news is that there actually is a good reason for this. And that the execution plan is not actually doing the same work four times. However, this is not visible by looking at a picture of the graphical execution plan; we need to dig deep into the details to understand what’s going on.
The devil is in the details
Before diving into the apparently repeated section of the plan, let’s first examine the Nested Loops join operator and the Constant Scan on its outer input. The Constant Scan operator is by itself actually already a mystery here. Looking at the query, there is no obvious reason at all to have a Constant Scan in this plan, and yet here it is.
Whenever we see a non-obvious Constant Scan operator, out first stop should always be its properties, specifically the Output List and Values properties. These expose the data that the Constant Scan returns. Though these properties are exposed in the property popup, the formatting of the Values property makes it hard to parse. The full properties window has the same data in an easier to understand format:
The Output list shows that this Constant Scan will return rows with just a single column, called Expr1005. Note this column name, because we need to find which operators use this column in order to understand the role of this Constant Scan. The Values property shows that four rows will be returned (not surprising, since we already saw that the inner input of the Nested Loops operator executes four times, and that the value of Expr1005 will be 1 for the first row, 2 for the second, then 3, and finally 4. None of these values appear in the query so they make no sense yet. We really need to find where these values are used!
Looking at the Nested Loops operator, we see a few interesting things. We know that it receives four single-column rows from its outer input, and then executes its inner input for each of these rows. Let’s look at its properties, more specifically at the Output List and Outer References properties:
Let’s start at the bottom, with the Outer References property. We have already seen this property in previous plansplaining posts, but as a quick reminder: Outer References means that a value from the outer input is pushed into the inner input; the inner input then ensures that only matching rows are returned which is why there is no Predicate property on this join operator. In this case Expr1005, the column that is set 1, 2, 3, and 4 for each of the rows from the Constant Scan, is pushed into the inner input.
The Output List does not include Expr1005. Apparently, the Nested Loops operator in this case doesn’t actually join its two input sources; it merely uses the outer input to drive four executions of the inner input, and then returns data from the inner input only. The values returned from the Constant Scan are not returned by the Nested Loops operator, which also means that we now know that the Stream Aggregate doesn’t do anything different than in the first plan – it doesn’t receive Expr1005, so it cannot act upon it.
Clustered Index Scan
The next part of the execution plan to look at is the section with a Hash Match and two Clustered Index Scan operators. I am not discussing the Hash Match here, because there is nothing of interest on this operator. However, both of the Clustered Index Scan operators have a property that most people would not expect to see, ever, on a Clustered Index Scan (or, in fact, any Index Scan) operator: a Seek Predicates property! (I show a screenshot of the second Clustered Index Scan here, but they both have a very similar Seek Predicates property).
Normally, a Seek Predicates property is only found on (Clustered) Index Seek operators, and not on a (Clustered) Index Scan. An Index Scan is not designed to seek; it is designed to read a whole table. So how can this scan suddenly apply a Seek Predicates? And also, what exactly is the predicate here? A column called PtnId1002, that we never created in any of our tables, is compared to the Expr1005 column, the data coming from the Constant Scan and pushed into the Nested Loop’s inner input.
Luckily the optimizer loves to use mnemonic codes when introducing its own columns. The name PtnId1002 is a so-called “partition id”. Remember, all the tables used in this example are partitioned tables. And partitioned tables happen to be the only (as far as I know) context where you can actually see a Seek Predicates property on a scan operator. It is used to limit the scan to one or more selected partitions only. In this case a single partition. Which partition? Well, that is determined by Expr1005.
Remember, Expr1005 is set to 1 for the first row (and hence the first execution of this section of the execution plan), then to 2 for the second, and so on. So for the first execution, the Clustered Index Scan operators will read data from partition 1 only. This applies to both Clustered Index Scan operators. The Hash Match then combines this data, returning the joined results for partition 1, which Nested Loops then passes to Stream Aggregate. After that, the second row from Constant Scan is read, and the inner loop restarts, this time reading and joining data from the second partition of both tables. Once all four partitions are processed in this way, Stream Aggregate returns the maximum value it found and execution stops.
Moving back to the helicopter view, we now know that the section consisting of one Hash Match and two Clustered Index Scans only appears to be the same in the two plans. In reality, the first plan actually processed the entire tables in the Clustered Index Scan operators, causing a massive join between 20,000 rows from each input. The second plan used a Constant Scan to enumerate the partitions, a Nested Loops to repeat the section for each partition, and a surprising Seek Predicates property on each of the Clustered Index Scan operators to process only a single partition in each execution. So while the logic is indeed executed four times, each execution now only had to join 5,000 rows from each input. And though the scans execute four times, they do not scan the full table four times. Each execution scans only the requested partition. In the end, the amount of IO for the Clustered Index Scan operators in this plan is exactly the same as in the first plan.
This is an optimization pattern known as “join collocation”. The most obvious place where you can see the effect of this optimization is in the memory grant (visible as a property of the SELECT operator in each execution plan). The first query requests 6112 KB, needed for the Hash Match operator to store 20,000 rows in the in-memory hash table. Because the second query processes the same 20,000 rows in four independent chunks of 5,000 rows each, its memory grant is a mere 1536 KB. This memory will be reused for each of the four executions.
On my laptop, this is not very relevant. But imagine a busy system, with lots of users, plus a buffer pool, a plan cache, and other resources, all competing over the same precious memory. Now a 75% reduction in the memory footprint of a query suddenly becomes very valuable!
Why not both?
At this point you may be wondering why SQL Server doesn’t always apply this pattern? Why did we not get a similar “join collocation” plan for the first query? Or for every join ever done between two large tables?
The answer is simple. Join collocation needs a few conditions to be met before it can be used. The conditions are:
- Both tables in the join need to be partitioned.
- The join condition has to include equality of the partition columns in the two tables.
- Both tables in the join need to use the same partitioning function. Or rather, the partitions need to align exactly (if you use two different partition functions with the exact same definition, you will still get join collocation).
The combination of these three requirements enables the join collocation pattern. Because the tables are partitioned, the (Clustered) Index Scan operators can use a Seek Predicates to return only a subset of the data without overhead. Because the partitions align and the join is on equality of the partition columns, we know that data that needs to be joined from both sources is processed in the same execution of the inner input.
In the case of the first query, the partitions did not align completely: they use the same boundary values, but one uses RANGE LEFT and the other RANGE RIGHT. Now it is possible that a row that is in partition 1 of the first table may need to be joined to a row in partition 2 of the second table, but these rows would not be processed during the same execution so the results can be incorrect.
Details are always important. Just because two execution plan fragments look the same in the graphical execution plan, does not mean they are the same. You always need to look at all the details that are available in the property sheet.
In this post, it appeared at first sight as if the same logic was being executed multiple times. But by looking at the properties we were able to ascertain that these four executions each worked on a quarter of the data. We were also able, by looking at the properties of a Constant Scan and following the generated column through the execution plan, to make sense of an operator that at first appeared to make no sense at all.
In plansplaining part four, planned for April, we will look at an execution plan that uses Bitmap operators to optimize a star join. However, I do not want this series to be about only my interesting plans. Reader input is greatly encouraged! If you have ever run into a pattern in an execution plan that appeared to make no sense until you dug into the gory details, or even into a pattern that still makes no sense to you, let me know and I will consider it for inclusion in a future post. (Note that I WILL ask you for a copy of the actual execution plan, saved as a .sqlplan file, or for a repro script).