Welcome to part twenty-one of the plansplaining series, where we will continue our look at execution plans for graph queries. In the previous post, we looked at the internal structure of node and edge tables, and discovered that they have a few hidden columns. Now let’s look how those columns are used in graph queries.
As mentioned in my previous post, I’m using sample tables and data from Microsoft’s documentation for these posts. You can find the script to create and populate the tables here, or in my first post on SQL Graph.
Let’s start with a really basic query. Here is the query that you can execute if you want to find everyone living in Redmond:
SELECT Person.name FROM dbo.City, dbo.livesIn, dbo.Person WHERE MATCH(Person-(livesIn)->City) AND City.name = 'Redmond';
If you’ve never queried graph tables before, then this syntax might take some getting used to. As you see, there are no join conditions in the FROM clause, only table names. This is a syntax that is actually allowed for all queries and joins, but because it is both outdated and error-prone, it is typically not recommended to use this syntax when using normal join conditions. However, the special syntax for graph queries requires this type of join. All tables used in the query must be listed here, but their order is irrelevant.
On a logical level, the FROM defines a cross join (Cartesian product) between the three tables. And if you forget to add a WHERE clause to restrict the output, then you will indeed get the Cartesian product in your results.
The first line of the WHERE uses a special syntax for graph-related join conditions. This format always uses the MATCH function, with a graph-based join condition within its parentheses. The example here is very simple. You can think of the –( and )-> that surround the name of an edge table (likes, in this case) as indicating the specific relation that this edge has to store. So we are looking for “likes” edges, that represent a fact that a specific person lives in a specific city.
Note that I could also have written this in a different order, with the City node table first and the Person node table last, but then I would have to reverse the arrow direction to still keep the same meaning: “MATCH(City<-(livesIn)-Person)”. When I switch the two nodes tables without reversing the arrow direction, we change the query to now look for cities that live in a person.
I obviously want to know how SQL Server processes the data in the relevant tables to find the people who live in Redmond. So let’s take a look at the execution plan. Here is a screenshot of the execution plan with run-time statistics, and I have once more added the Node ID values in the screenshot for easier reference in the rest of the text.
Note that Node ID 5 is missing In an early stage of the execution plan, there is a Compute Scalar operator between Nested Loops 1 and RID Lookup 6, but since no actual computations are needed, this node gets removed in the final cleanup.
Execution starts as always with the top-left SELECT calling its child, Nested Loops 0, to request a row. That operator immediately calls its child, and so on, until control is passed to Table Scan 3.
Default indexes for graph tables
I’m generally not happy when I see a Table Scan in an execution plan. Not because it is a scan. Scans may be bad or may be great, depending on circumstances. The reason I dislike Table Scans is because they indicate that the underlying table is a heap, rather than a clustered index. As a general recommendation, I always advice people to create a clustered index on each and every table, and only use heaps when they really need to, and when they know exactly what kind of problems they can cause and how to work around them.
For the livesIn table, we didn’t specify any indexes when we created it. How could we? Remember, this is the statement we used to create the table:
CREATE TABLE dbo.livesIn AS EDGE;
We didn’t even specify a single column, so how could we have defined indexes?
But SQL Server did create an index for us. And if you try to find out the definition of that index, you’ll find that SQL Server actively tries to fool you. You can see the index in the object manager. You can right-click it and ask it to create a CREATE INDEX script. And if you do, you’ll get a CREATE INDEX script for a nonclustered unique index on the $edge_id column – remember, this is the computed column with the JSON representation of the unique edge identifier. However, if you right-click the index and click Properties, you will instead see graph_id listed as the indexed column. And this is in fact confirmed when you query sys.index_columns and sys.columns.
Long story short, when creating any type of graph table (both edge and node tables), SQL Server will always automatically create a unique nonclustered index on the hidden graph_id column. You can see this by querying the system views, or by checking the properties window, but when you script the index, both SSMS and ADS will claim that the index is on $edge_id or $node_id instead.
No other indexes are automatically created. When you specify your own columns in the CREATE TABLE statement, then you can also create indexes on them, both inline in the CREATE TABLE statement, or in separate CREATE INDEX statements. And you can manually create indexes on the visible generated columns ($node_id, $edge_id, $from_id, $to_id). But there is no way to create additional indexes on any of the hidden columns, nor to modify the one index that is automatically created. (You can, however, drop the automatically generated index if you want – but beware that you cannot re-create it, other than by dropping and re-creating the table).
NOTE: The above statements are not fully correct, and therefore very misleading. Please read this post for a full explanation.
For now, I’ll roll with the indexes that I have. Two of the node tables in the sample data script have a primary key declared, so they have clustered indexes. All other tables are heaps, and the only nonclustered indexes are those that SQL Server generated. Let’s return to the execution plan to finally see how SQL Server evaluates this simple graph query.
Table Scan 3
As mentioned, each Nested Loops operator calls its child for a row, until Table Scan 3 is reached. This operator then reads data from the edge table livesIn. But it doesn’t simply read and return all data!
In the screenshot of the properties popup on the right, you can see that this Table Scan has a Predicate property. This is a filter applied to rows, after they are read, but before they are returned to the parent operator. And the text of this property reads, simplified: “from_obj_id = 901578250 AND to_obj_id = 949578421”. The two hardcoded numbers in this expression are the object_id of the Person and City tables respectively.
It is important to understand that, by default, edge tables allow the storage of relationships between all node tables. So even though we know that it is nonsense in the real world, SQL Server allows us to store information about restaurants living in cities, or persons living in persons, or … well, you get the idea. The Predicate property on this Table Scan ensures that only rows are returned that represent a person living in a city, as specified in the MATCH pattern. There are no useful indexes to quickly find those rows, and that’s why the entire heap table has to be scanned.
The columns returned from this Table Scan, as specified in the Output List property, are the from_id and the to_id of the selected edge rows. They specify what specific person lives in what specific city. The from_obj_id and to_obj_id are not needed, we already know that they are 901578250 and 949578421 respectively.
The Actual Number of Rows Read property shows that five rows are returned in total. In this case, that happens to be the entire population of the table, because the sample data includes only “lives in” relationships between persons and cities.
Nested Loops 2 into Index Seek 4
The next steps are quite simple, and for shortness sake I have decided not to include screenshots of all properties I mention. It’s easy to run the code on your own system and check, if you want.
For each of the five rows returned by Table Scan 3, Nested Loops 2 initializes its bottom input and executes it. The Outer References property indicates that the to_id value that was returned from Table Scan 3 is pushed into this bottom input. As mentioned before, this to_id represents the city represented in the livesIn edge row.
On the bottom input of Nested Loops 2 is Index Seek 4. This Index Seek searches in the GRAPH_UNIQUE_INDEX index on the City node table, as you can see in its Object property. And the Seek Predicates property shows which rows it tries to find. Simplified, this property reads: “City.graph_id = livesIn.to_id”. So here, the to_id value that Nested Loops 2 pushed into this Index Seek is used to find the correct row in the index.
Because we’re reading from a nonclustered index, the operator can only return columns that are included in this index. City happens to be one of the node tables for which a primary key was declared, so it has a clustered index on its ID column. This means that the nonclustered index on graph_id includes two columns: graph_id itself, and ID. The Output List property shows that only ID is returned to the Nested Loops.
Nested Loops 2 combines the data returned from Index Seek 4 with the data it has from Table Scan 3, and then returns livesIn.from_id and City.ID to its parent, as can be seen in its Output List property.
Nested Loops 1 into Key Lookup 6
The next step is a very similar pattern. Nested Loops 2 passes rows with the two columns lives_in.from_id and City.ID to Nested Loops 1. That Nested Loops also has an Outer References property. It passes City.ID into its lower input, where Key Lookup 6 can then use it.
A Key Lookup reads a row from a clustered index based on its clustered index key. So it’s no surprise to see the Seek Predicates show that we’re looking for rows with “City.ID = City.ID”. Reading it like this makes it look like a tautology. However, internally, SQL Server knows that the first City.ID refers to data in the clustered index that the Key Lookup reads from (which is in this case the clustered index on the City table, as you can see in the Object property), whereas the second City.ID is the value pushed in from Nested Loops 1, that originated from Index Seek 4. So Key Lookup will find that entry in the index and return its data.
But that’s not all. There is an extra property on this Key Lookup, a Predicate property. Not many people know that this property is supported on the Key Lookup operator. And yet it is. The text of the Predicate here reads, once more in simplified form: “City.name = ‘Redmond’”.
So what the operator actually does is, it first uses the B-tree structure of the clustered index to find the row with City.ID equal to the City.ID value that was passed in. But it then doesn’t automatically return it. It checks the city name, and compares it to Redmond. If it’s a match, then the row is returned. But if the city has any other name, then the Key Lookup simply won’t return a row at all.
And you can indeed see this in the Number of Executions and Actual Number of Rows properties. There were 5 executions, because Nested Loops 1 received 5 rows on its top input. For each of those 5 rows, it initialized and executed the Key Lookup. So Key Lookup fetched the five requested rows from the City table. But then, after checking the city name, it only returned 2 rows, because only those 2 matched the Predicate.
You see no Output List property in the screenshot. If you check the full properties window, you’ll see an empty Output List there. That’s because the Key Lookup doesn’t need to return any data. The optimizer didn’t include the Key Lookup because it needs additional columns from the clustered index. It included it for the sole purpose of filtering on a column that is in the clustered index. That filter is pushed into the Key Lookup itself. Once a row passes the filter, it is sufficient to return an empty row to Nested Loops, to indicate a match. For rows that do not pass the filter, the end of data signal is returned instead.
Nested Loops 1 does an Inner Join operation, as you can see from its Logical Operation. So if it receives no matching rows from the Key Lookup, it will discard the data from its top input as well. When it does receive a row from Key Lookup (even when it’s an empty zero-column row), then it will return data to its parent. In this case, the data returned is only the livesIn.from_id column. The City.ID column is no longer needed in the rest of the execution plan.
Nested Loops 0 into Clustered Index Scan 7
You might be tempted to expect yet another occurrence of the same behaviour from the last remaining Nested Loops operator, the one with Node ID 0. But you would be wrong. If you check its properties, you’ll see no Outer References property in this case. Instead, there is now a Predicate property.
Instead of pushing a value from the top input into the lower input to make it return only the required rows, we now have a lower input that, on every execution, returns all rows that might be a match, and the Nested Loops tests to see which are and rejects the rest.
And that’s also why there is no seek on the lower input, but a scan. A Clustered Index Scan, in this case. For every row on the top input, 2 in this case, the bottom input is initialized and executed, returning all 5 rows from the Person table for each of those 2 executions.
The Output List of the Clustered Index Scan shows that it returns two columns: the hidden graph_id column, and the Name column. The Nested Loops then compares the graph_id received to the value it received from Nested Loops 1 in livesIn.from_id. If they are the same, there is a match, and the value in Person.City is returned to the client. Otherwise, the row returned from the bottom input is ignored.
Because of the automatically generated unique nonclustered index on graph_id in the Person table, we know that, after finding a match, there can be no further matches. However, the execution plan that is used does not use this knowledge. It will continue to read and check the rest of the rows from the bottom input after finding a match. There are tricks that the optimizer can use to prevent this, which would save a bit of work, but those tricks do introduce their own overhead. Based on the estimated amount of data to process, the optimizer has in this case decided that this execution plan is the cheaper of the options.
By looking at the various actions in the execution plan, we can now reconstruct the query. Why would we do that, when we already typed the query ourselves? Well, in this case it gives insight in how a condition such as “MATCH(Person-(livesIn)->City)” is translated internally.
The execution plan started by reading the livesIn nodes that connect a Person to a City, based on the from_obj_id and to_obj_id columns. It then joined to the City node table, using the internal columns graph_id and to_id, in order to filter on City.name = ‘Redmond’. And then it joined to Person in order to retrieve the Person.name column.
In query form, we could write this as follows:
SELECT Person.name FROM dbo.livesIn INNER JOIN dbo.City ON City.graph_id = livesIn.to_id AND City.name = 'Redmond' INNER JOIN dbo.Person ON Person.graph_id = livesIn.from_id WHERE livesIn.from_obj_id = OBJECT_ID ('dbo.Person') AND livesIn.to_obj_id = OBJECT_ID ('dbo.City');
Or, if we want to rewrite this in a more logical order, it would look like this query (which is exactly the same, just reordered):
SELECT Person.name FROM dbo.City INNER JOIN dbo.livesIn ON livesIn.to_id = City.graph_id AND livesIn.to_obj_id = OBJECT_ID ('dbo.City') INNER JOIN dbo.Person ON livesIn.from_id = Person.graph_id AND livesIn.from_obj_id = OBJECT_ID ('dbo.Person') WHERE City.name = 'Redmond';
This query will not run. You’ll get an error because you are not allowed to access the hidden internal columns directly. But if you could somehow get SQL Server to run this, it would return the same results as the original query, and most likely use the exact same execution plan.
However, there is yet another alternative way to rewrite the query, by using the internal columns that we DO have access to: the JSON representations of the graph_id, from_id, from_obj_id, to_id, and to_obj_id columns. And in that case, the query now looks like this:
SELECT Person.name FROM dbo.City INNER JOIN dbo.livesIn ON livesIn.$to_id = City.$node_id INNER JOIN dbo.Person ON livesIn.$from_id = Person.$node_id WHERE City.name = 'Redmond';
This query does indeed execute, and it returns the expected results. And I was relieved to see in the execution plan that it does not actually use the computed JSON columns in the execution plan in the way they are specified in the join condition. Once more, the actual joins are based on the internal columns, graph_id, to_id, to_obj_id, from_id, and from_obj_id.
The execution plan is mostly the same, but not entirely identical:
The right-hand side of this execution plan is identical to what we saw before. But in this case, the optimizer decided to implement the join to the Person table by first doing an Index Seek on the automatically generated GRAPH_UNIQUE_INDEX index to find the clustered index key of matching rows, followed by a Key Lookup to fetch the required data from the clustered index. Apparently, the slightly different way to phrase what is effectively the same query, affected SQL Server’s internal costing estimates, so that now a different choice was made.
The query I used as an example is the most basic form of graph query. The same syntax can be expanded to do more complex queries. For instance, here is how we could change the query to find not only the people who live in Redmond, but also all the restaurants they like:
SELECT Person.name, Restaurant.name FROM dbo.City, dbo.livesIn, dbo.Person, dbo.likes, dbo.Restaurant WHERE MATCH(Restaurant<-(likes)-Person-(livesIn)->City) AND City.name = 'Redmond';
As you see, you can have multiple connections in a single MATCH statement. They can be the same or different directions. A syntactically equivalent way to write the same query would be:
SELECT Person.name, Restaurant.name FROM dbo.City, dbo.livesIn, dbo.Person, dbo.likes, dbo.Restaurant WHERE MATCH(Restaurant<-(likes)-Person AND Person-(livesIn)->City) AND City.name = 'Redmond';
SELECT Person.name, Restaurant.name FROM dbo.City, dbo.livesIn, dbo.Person, dbo.likes, dbo.Restaurant WHERE MATCH(Person-(likes)->Restaurant AND Person-(livesIn)->City) AND City.name = 'Redmond';
All of those return the same result.
I have decided not to show the execution plan. You can request it yourself, if you are interested. It’s not very interesting, in my opinion. Basically, it does the same as in the execution plan of the first query, but now with an additional join to likes (based on likes.$from_id = livesIn.$from_id), and then to Restaurant (based on Restaurant.$node_id = likes.$to_id). And note that I have used the JSON columns in the previous sentence for easy understandability, but the operators in the execution plan in reality of course still work on the hidden columns.
The special syntax of the MATCH function, used to join node and edge tables in an almost visual representation, is internally translated to regular joins on the internal columns that we explored in the previous part of this series. That version of the SQL is then optimized into an execution plan. No special operators are used; just by using those internal columns, SQL Server can achieve the requested results with just standard operators.
In the next part of this series, I plan to look at the SHORTEST_PATH function, that can be used to make SQL Server search a graph iteratively or recursively to find the shortest possible path from one node to another. After that, I want to take a more in-depth look at the default indexes created for graph tables, their performance impact, and possible ways to improve performance for these queries.