SQL Graph indexing – I stand corrected

SQL Graph indexing – I stand corrected

I made a mistake in my last blog post. This mistake was pointed out to me, in private, by a Microsoft Product Manager, with a request to correct that blog post. Because his correction gave me a lot of additional information, I decided to even write a full post about it.

Graph tables and indexes

In my post, I pointed out that SQL Server automatically creates a unique nonclustered index on the (internal and hidden) column graph_id, that you can delete, but can’t modify. I added that users cannot specify indexes on that column, nor on the other internal and hidden columns that exist in edge tables (from_obj_id, from_id, to_obj_id, and to_id). These statements are factually correct, but misleading.

I then concluded from those facts that once you delete the automatically generated index, you can not re-create it. This is actually incorrect.

In reality, you have far more control over the indexes on graph tables than I assumed. Unfortunately, this is not at all clear from the existing documentation, which is why it is very easy to jump to the wrong conclusions – as I have proven!

Relevance of the unique index

For node tables, the automatically created unique index is very important. Without it, there would be ways for users (e.g.  by using BULK INSERT to insert data taken from an incorrectly populated file) to insert multiple nodes with the same graph_id in a node table. That would make all references to those nodes ambiguous. Basically, the graph could be considered corrupted. It makes sense that Microsoft automatically creates a unique index on graph_id in node tables, so that modifications that corrupt the graph are automatically rejected.

That said, I still maintain that there is no such practical need for the corresponding index on graph_id in edge tables. Nothing in a graph database ever refers to an edge by its graph_id, nor by its $edge_id (which is derived from the graph_id). A duplicate in this column would not corrupt the graph.

In SQL Server 2017, the first version that supports graph tables, it was not even possible to drop this auto-created index. At least not without first creating a new index to ensure that the graph_id (and/or $node_id / $edge_id) column is unique. In SQL Server 2019, this protection was dropped. It is now always possible to drop this index. However, for the reason mentioned above, I would personally not do this for a node table without creating a suitable replacement. For an edge table, I don’t really see any reason to have an index on graph_id or $edge_id, so here I believe it’s safe to drop the automatically created index.

Controlling the index

Once the index is created, there is really no way to change it. But that is true for all indexes. SQL Server supports an ALTER INDEX statement, but despite its many options, it can’t actually change the index type, indexed columns, or included columns. Indexes on graph tables are no exception.

However, what I did not realize when writing my last blog post, is that you can actually exert a lot of control over these indexes when creating the tables. For instance, here is how I could create a node table for persons and provide a more friendly name for the unique index, without any actual changes:

CREATE TABLE dbo.Person
    (ID   integer PRIMARY KEY,
     name varchar(100),
     INDEX ix_Person_Graph_Unique UNIQUE NONCLUSTERED ($node_id)) AS NODE;

And here is how I could define an edge table that elevates the unique index to primary key status, which by default creates a clustered unique index instead of a nonclustered one; without providing a friendly name:

CREATE TABLE dbo.friendOf (PRIMARY KEY ($edge_id)) AS EDGE;

Similarly, if we still want the index to serve as the primary key, but now with a nonclustered index and with a friendly name, we would change this to:

CREATE TABLE dbo.friendOf (CONSTRAINT PK_friendOf 
                               PRIMARY KEY NONCLUSTERED ($edge_id)) AS EDGE;

In all these cases, and many more that you can explore yourself, SQL Server recognizes that the CREATE TABLE statement already specifies an index (explicitly through the INDEX keyword, or implied through a PRIMARY KEY or UNIQUE constraint) on the $node_id or $edge_id column. That means that there is no need to add an extra index on the graph_id column. SQL Server only auto-creates this index if the table would otherwise allow duplicates in graph_id and/or $node_id/$edge_id.

But are they the same?

You may recall that the automatically generated index is always on the graph_id column. The examples above all specify an index on either the $node_id or the $edge_id columns. As I described here, these columns are computed columns, that represent the graph_id, combined with two constant values (the object_id of the table and the type, node or edge), in JSON format. So at first sight, the alternatives above do not seem equivalent at all! An index on an nvarchar(1000) computed column seems to be far less efficient than one on a simple bigint column.

However, there is some additional special magic going on under the covers. When a CREATE TABLE statement for a graph table includes a PRIMARY KEY, UNIQUE, or INDEX specification on $node_id or $edge_id, the engine internally automatically translates this, and instead creates the index on the graph_id column. Because graph_id is the only variable component in the computation of $node_id and $edge_id, this replacement index is functionally the same, but far more efficient. And, indeed, totally equivalent to the auto-created index that you would get if you don’t explicitly specify the index.

You can see that this is indeed the case by creating two node tables (or two edge tables) with the same schema, except one uses an explicitly created index on $node_id and the other doesn’t specify any index, so that the system auto-creates an index on graph_id. If you then query sys.index_columns, you will see that the two indexes are actually identical, except for the differences that you specifically requested.

CREATE TABLE dbo.Node1
    (ID int) AS NODE;
CREATE TABLE dbo.Node2
    (ID int,
     INDEX ix_Node2 UNIQUE ($node_id)) AS NODE;

SELECT     OBJECT_NAME (i.object_id) AS table_name,
           i.name                    AS index_name,
           ic.index_id,
           ic.key_ordinal,
           c.name                    AS column_name,
           c.graph_type_desc
FROM       sys.indexes       AS i
INNER JOIN sys.index_columns AS ic
   ON      ic.object_id = i.object_id
   AND     ic.index_id  = i.index_id
INNER JOIN sys.columns       AS c
   ON      c.object_id  = ic.object_id
   AND     c.column_id  = ic.column_id
WHERE      OBJECT_NAME (i.object_id) IN ('Node1', 'Node2');

Here are the results when I ran this on my system. (When you run it, the hex strings in the generated object names will of course be different, but the rest should be the same)

So even though the syntax used in the CREATE TABLE statement suggests that I created an index on $node_id, SQL Server internally translated this to the equivalent index on graph_id, and then created that index instead of, rather than in addition to, the default auto-created index.

Modifying the index

While it is preferred to create a table with the correct indexes right away, there are sometimes situations where this is not possible. The table already exists. There are performance issues. You investigate the problem queries. You check their execution plans. And then you realize that the indexing is far from optimal. You want to change the indexes.

As already mentioned, literally changing the indexes is not an option. That is not a limitation for graph-specific indexes, but for all indexes. There simply is no support for changing an existing index in SQL Server. The only way to effectively change an index, is to drop and recreate it, with the improved specification.

We have already seen that any index specified on $node_id or $edge_id in the CREATE TABLE statement is automatically translated to the equivalent (but far more efficient) index on graph_id. Does that same trick also work for the CREATE INDEX statement?

Yes, it does! In the example below, I first create an edge table without specifying any index, so that the default index will be auto-created. I query the system objects to see its name, and to double check that it is indeed on the graph_id column. I then drop this index (you’ll need to change the name if you follow along with the examples!), and then create a new index, this time as a clustered index, while using the $edge_id column to effectively create an index on graph_id, as the final query of the system objects shows:

CREATE TABLE dbo.Edge1 AS EDGE;

SELECT     OBJECT_NAME (i.object_id) AS table_name,
           i.name                    AS index_name,
           ic.index_id,
           ic.key_ordinal,
           c.name                    AS column_name,
           c.graph_type_desc
FROM       sys.indexes       AS i
INNER JOIN sys.index_columns AS ic
   ON      ic.object_id = i.object_id
   AND     ic.index_id  = i.index_id
INNER JOIN sys.columns       AS c
   ON      c.object_id  = ic.object_id
   AND     c.column_id  = ic.column_id
WHERE      OBJECT_NAME (i.object_id) = 'Edge1';

DROP INDEX GRAPH_UNIQUE_INDEX_ED450FBFB2B3489F8E4466CB451C2AAD ON dbo.Edge1;
CREATE UNIQUE CLUSTERED INDEX ix_Edge1 ON dbo.Edge1 ($edge_id);

SELECT     OBJECT_NAME (i.object_id) AS table_name,
           i.name                    AS index_name,
           ic.index_id,
           ic.key_ordinal,
           c.name                    AS column_name,
           c.graph_type_desc
FROM       sys.indexes       AS i
INNER JOIN sys.index_columns AS ic
   ON      ic.object_id = i.object_id
   AND     ic.index_id  = i.index_id
INNER JOIN sys.columns       AS c
   ON      c.object_id  = ic.object_id
   AND     c.column_id  = ic.column_id
WHERE      OBJECT_NAME (i.object_id) = 'Edge1';

The name of the index changed (of course). And the index_id changed from 2 to 1, because index_id 1 is always used for the clustered index. But otherwise, the results from the query on sys.index_columns is unchanged. We have effectively changed the index from a nonclustered unique index on graph_id with a randomly generated name, to an equivalent but named and clustered unique index on that same column.

If you try to run this same code on a SQL Server 2017 instance, you will get an error on the DROP INDEX statement, because this version does not allow you to have no unique indexes on the $node_id or $edge_id (or effectively the graph_id) column. However, if you swap the CREATE INDEX and DROP INDEX statements, then the code will run, because now there is already another unique index on $edge_id when you drop the auto-created index.

Additional indexes

We now know that we can specify indexes on $node_id and/or $edge_id, to effectively create them on graph_id. We used that above in combination with a DROP INDEX for the original index, to effectively modify the index. But we can also use the same syntax to just create multiple indexes on the same column. Here is a very contrived example:

CREATE TABLE dbo.Person
    (ID   integer PRIMARY KEY,
     name varchar(100),
     INDEX ix_Person_graph_id UNIQUE ($node_id),
     INDEX ix_Person_name_graph_id (name, $node_id)) AS NODE;

CREATE UNIQUE INDEX ix_Person_graph_id_include
ON dbo.Person ($node_id)
INCLUDE (name);

While this might be useful in very specific edge cases, I expect that under normal conditions, you would never want to do this. Just one index on the graph_id column will in almost all cases be enough.

However, one type of graph table might actually benefit a lot from additional indexes: edge tables. While the default index on graph_id is, in my opinion, worse than useless, indexes on one or more of the other internal columns (from_obj_id, from_id, to_obj_id, and to_id) could actually prove useful! But, like graph_id, those columns cannot be accessed. However, perhaps the same trick that allowed us to index the graph_id column by specifying $node_id or $edge_id works on the $from_id and $to_id computed columns in edge tables as well?

Let’s give it a try!

CREATE TABLE dbo.friendOf 
         (CONSTRAINT PK_friendOf 
                PRIMARY KEY ($edge_id),
          INDEX ix_friendOf_from ($from_id),
          INDEX ix_friendOf_to ($to_id)) AS EDGE;

SELECT     OBJECT_NAME (i.object_id) AS table_name,
           i.name                    AS index_name,
           ic.index_id,
           ic.key_ordinal,
           c.name                    AS column_name,
           c.graph_type_desc
FROM       sys.indexes       AS i
INNER JOIN sys.index_columns AS ic
   ON      ic.object_id = i.object_id
   AND     ic.index_id  = i.index_id
INNER JOIN sys.columns       AS c
   ON      c.object_id  = ic.object_id
   AND     c.column_id  = ic.column_id
WHERE      OBJECT_NAME (i.object_id) = 'friendOf';

Here is the output on my system:

As you can see, the two additional indexes were indeed internally converted. Instead of using the $from_id and $to_id columns as the indexed columns, as we specified in the CREATE TABLE statement, it uses from_obj_id and from_id, and to_obj_id and to_id respectively.

Best practices

Note that the example above is not necessarily a good way to index this table. The SQL Server Product Manager who mailed me said that he, when presenting on graph tables, usually recommends one index on ($from_id, $to_id) and another index on ($to_id, $from_id) as a generic best practice. However, this advice is not included in any official Microsoft documentation at this time. And I also don’t know whether this is actually the best way to index edge tables in all cases. I plan to take a deeper look at indexing myself in a future post. That said, I do absolutely agree that this generic best practice runs rings around the default indexing that you get when you just create an edge table without specifying indexes!

Another best practice that the Product Manager included in his feedback to me, was to always explicitly declare the unique index on $node_id and $edge_id, rather than have the system auto-create one. This allows you to give the index a friendly name, and it also allows you to specify whether it should be clustered or nonclustered. I totally agree with this advice!

(Although I would personally add one extra step – after creating an edge table with a manually specified index on $edge_id, I would immediately drop that index again, on SQL Server 2019 or above, since I have yet to find a single case where it would actually be useful.)

Conclusion

We cannot specify the hidden internal columns of graph tables when creating indexes. But we can still create indexes on those columns, by specifying the computed columns $node_id, $edge_id, $from_id, and $to_id. This weird (and, unfortunately, barely documented) behaviour allows us to index these hidden internal columns indirectly.

But we are still somewhat limited. An index specified on $from_id is always converted to an index on (from_obj_id, from_id). There is no way to reverse those columns, to create an index on (from_id, from_obj_id). Nor on from_id only, or from_obj_id only. Now I don’t know whether any of those would ever be useful. But if they are, then you can sadly not create them.

All in all, the indexing story for graph tables is far, far better than I thought when I wrote my previous blog post. You just need to know the weird behaviour of the syntax that let’s you create indexes on one column by specifying an index on another.

 

Plansplaining, part 21. SQL Graph (part 2)
T-SQL Tuesday 164 – Optimizing for readability or for performance

Related Posts

No results found.

2 Comments. Leave new

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.

Close