Plansplaining, part 17. Temporal tables (part 2)

Plansplaining, part 17. Temporal tables (part 2)

Welcome to part seventeen of the plansplaining series, where I will continue my look at how temporal tables affect execution plans. The previous post focused on data modifications in temporal tables. Now let’s see what happens if we use temporal logic to our basic retrieval queries. This post focuses on the basics only; do keep in mind that as soon as you want to join data from multiple temporal queries you will run into some specific issues that you need to be aware of; this will be the subject for a later post.

Sample tables

For the code samples in this post, I keep building on the demo tables and sample data from the previous post. If you followed along and have not cleaned up, then you can just keep going. If you need to re-create them, you can use the code below. In that case, do make sure to pause for a bit where the comments indicate it, so that you can (if you wish) stick in the correct timestamps in the sample queries below to get relevant data returned.

CREATE TABLE dbo.Suppliers
    (SupplierID   int          NOT NULL,
     SupplierName varchar(50)  NOT NULL,
     ValidSince   datetime2(7) GENERATED ALWAYS AS ROW START NOT NULL,
     ValidUntil   datetime2(7) GENERATED ALWAYS AS ROW END NOT NULL,
     PERIOD FOR SYSTEM_TIME(ValidSince, ValidUntil),
     CONSTRAINT PK_Suppliers
         PRIMARY KEY (SupplierID))
WITH (SYSTEM_VERSIONING = ON (HISTORY_TABLE = dbo.SuppliersHistory));

CREATE TABLE dbo.Products
    (ProductCode char(10)     NOT NULL,
     ProductName varchar(50)  NOT NULL,
     SupplierID  int          NULL,
     ValidSince  datetime2(7) GENERATED ALWAYS AS ROW START NOT NULL,
     ValidUntil  datetime2(7) GENERATED ALWAYS AS ROW END NOT NULL,
     PERIOD FOR SYSTEM_TIME(ValidSince, ValidUntil),
     CONSTRAINT PK_Products
         PRIMARY KEY (ProductCode),
     CONSTRAINT FK_Products_Suppliers
         FOREIGN KEY (SupplierID)
         REFERENCES dbo.Suppliers (SupplierID),
     INDEX ix_Products_SupplierID NONCLUSTERED (SupplierID))
WITH (SYSTEM_VERSIONING = ON (HISTORY_TABLE = dbo.ProductsHistory));
GO

-- Insert some sample data
INSERT dbo.Suppliers (SupplierID,
                      SupplierName)
VALUES (1, 'Supplier 1'),
       (2, 'Supplier 2'),
       (3, 'Supplier 3'),
       (4, 'Supplier 4'),
       (5, 'Supplier 5'),
       (6, 'Supplier 6');

INSERT dbo.Products (ProductCode,
                     ProductName,
                     SupplierID)
VALUES ('Prod 1', 'Product number 1', 1),
       ('Prod 2', 'Product number 2', 2),
       ('Prod 3', 'Product number 3', 3),
       ('Prod 4', 'Product number 4', 4),
       ('Prod 5', 'Product number 5', 5),
       ('Prod 6', 'Product number 6', 1),
       ('Prod 7', 'Product number 7', 2),
       ('Prod 8', 'Product number 8', 3),
       ('Prod 9', 'Product number 9', 4);
GO

-- Pause here for a minute or so
-- Rename product 4
UPDATE dbo.Products
SET    ProductName = 'New name for ' + ProductName
WHERE  ProductCode = 'Prod 4';
GO

-- Pause here for a minute or so
-- Delete supplier 6
DELETE dbo.Suppliers
WHERE  SupplierName = 'Supplier 6';
GO

Note that the sample data above is severely limited. If you want to make the sample scenarios, queries, and results a bit more interesting, then feel free to simulate some more changes over time. But remember, the focus of this post is not on realistic use cases and credible sample data for temporal tables; I am at this time only interested in how temporal queries work under the cover, to help me better understand their potential performance impact.

Retrieving data with regular queries

When data is in temporal queries, you can keep using the normal queries to retrieve the data, and they will work as normal. Nothing special in the query, and nothing special in the execution plan.

SELECT p.ProductName,
       p.SupplierID
FROM   dbo.Products AS p
WHERE  p.SupplierID = 3;

The query above will simply grab the requested data from the Products table as if it were a normal table. The corresponding history table is not used at all. To understand why this happens it is important to recall that system-versioned temporal tables always store the currently valid version of the row in the table itself; all older, no longer valid versions are stored in the accompanying history table. So the query above, which does not use any specific temporal logic, means “I don’t care about the history, I want the data as it is now”. Due to how temporal tables are designed, SQL Server only has to query the actual table for this.

Since there is nothing special or interesting about this, let’s move on to queries that do use special logic in the query to retrieve older versions of the data.

Retrieving data with temporal queries

Specifically for temporal queries, you can use the “FOR SYSTEM_TIME” keyword to tell SQL Server you want to return older, historic data. This keyword goes immediately after the name of the table and is then followed by a specification of the exact moment or period in time that you want data for. This specification can take one of five forms, and below we’ll examine the corresponding execution plan for each of those forms.

FOR SYSTEM_TIME AS OF <date_time>

The AS OF specification is the form I expect to be used the most. This syntax allows you to view your data in the state it was at the specified moment in time. Imagine an employee in the finance department on the phone with a customer who wants to contest their invoice. For that employee, looking at the current prices of items won’t be useful. They need to see the prices as they were at the time the invoice was created. The AS OF specification for a temporal query allows exactly that.

DECLARE @AsOf datetime2 = '2021-04-03 15:45';
SELECT p.ProductName,
       p.SupplierID
FROM   dbo.Products FOR SYSTEM_TIME AS OF @AsOf AS p;

Many examples I see for temporal tables put a hardcoded date and time in the FOR SYSTEM_TIME specification. While that is allowed, it is in actual code much more useful to use parameters, variables (as in the demo code above), or even expressions. To make the code return data on your system, simply change the value assigned to the @AsOf variable to a date and time when there was data in your sample table. And when doing so, do keep in mind that both the datetime values stored in the ValidSince and ValidUntil columns of all temporal tables as well as the datetime values used in any FOR SYSTEM_TIME specification must always be specified in the UTC time zone, not in the local time zone of your server or your workstation.

Running the query above on my system produces an execution plan with run-time statistics as shown above. You will probably notice the large difference between estimated and actual number of rows produced by each operator; the reason for that is that @AsOf is a variable that cannot be sniffed. There are ways, such as putting the date/time hardcoded in the query, adding OPTION(RECOMPILE), or embedding the query in a stored procedure, that will allow SQL Server to get better estimates for the number of matching rows, but they can cause issues of their own. Fixing potential issues is beyond the scope of this post, I just want to make sure you are aware of the potential for bad estimates when running temporal queries.

Clustered Index Scan

To understand how exactly the execution plan returns only the rows that were valid on the specified date and time, we need to look at the properties of some operators. To the right you see the properties of the top right operator, the Clustered Index Scan on the Products table (which is of course the table with the currently valid data, some of which might have been valid already at the specified moment).

The temporal magic for this operator is in the Predicate property, where two conditions are verified. The ValidSince value in the column has to be less than or equal to the specified as-of date. That condition makes sense: if the ValidSince value is higher than the as-of date, the row was inserted or modified after the moment in time we want the data for and this row should not be included in the results.

The second condition is more surprising. Apparently, SQL Server also tests that the ValidUntil value is higher than the specified as-of date. This is weird for the table with the actual data, because this table stores data valid until “eternity”. That means there is no real need to test the ValidUntil date in the actual table. If the ValidSince is before the specified as-of value, the row was valid at that moment; the mere fact that the row is still in the actual table suffices to know it has not changed since and is hence still valid.

Even if the query is run with an as-of date in the future (which, for the record, is not an intended use case for temporal tables), we still expect it to return all current data. Temporal tables are not AI based models to predict the future. They track the past and the current situation, and the current situation is assumed to be valid “from now until eternity” until someone runs a data modification statement. So even when the query specifies an as-if moment in the future, we still want all actual rows to be returned. This condition is useless.

And yet it’s there. This even results in a weird behaviour (potentially even a bug?) when you specify the as-of moment as the exact maximum value that the ValidUntil column can store. What that value is depends on the precision specified for that column when creating the table. In my demo I used datetime2(7), so the maximum value is ‘9999-12-31 23:59:59.9999999’. A query with any as-of moment lower than that, up to and including ‘9999-12-31 23:59:59.9999998’, will return all current rows. But use ‘9999-12-31 23:59:59.9999999’ and now you suddenly get an empty result set, because this is the value stored in the ValidUntil column. The values are equal and the Predicate tests for higher than, not higher than or equal to.

In the case of my sample data, almost all rows in the actual table are still valid. Using a scan to find these is pretty effective. But that might not always be the case. If you have a large catalogue of products and a high rate of change, you might see this scan operator read through thousands, perhaps even millions of rows and not find more than a handful that qualify. We’ll discuss this and other performance considerations later.

Clustered Index Seek

The bottom right operator is in this execution plan a Clustered Index Seek, targeting the history table where all invalidated rows are stored. And here, too, we need to check the properties to see the specific logic for the temporal aspect of the query.

The first thing we notice in the properties of this Clustered Index Seek is the Predicate property, with the same test on ValidSince being less than or equal to the as-of specification. I already explained why this condition makes sense when we first encountered it on the actual table. Here the same condition is needed for the same reason.

There is no test on ValidUntil in the Predicate property, though. Not because it’s not needed. In fact, where this test is technically not needed for the actual table, it most definitely IS needed when accessing the history table. No, the reason it’s not in the Predicate property is because this operator is a seek operator, not a scan operator, and the condition on ValidUntil is in this case the Seek Predicates property.

People tend to be happy when they see seek operators in execution plans, because they incorrectly assume seeks are faster than scans. In this case, it is important to be aware that the seek is based on only the ValidUntil column. If the specified as-of value is fairly recent and most of the data in the history table has a ValidUntil that is older than this as-of value, then the seek will indeed be efficient in this case. However, if you use the temporal query to go far back into history, the seek will still need to process lots of rows.

Concatenation

For completeness sake, I’ll mention that the execution plan has one more operator: Concatenation. The function of this operator in an execution plan is equivalent to the function of UNION ALL in a query. So in this case, Concatenation first returns all rows from the first input, the Clustered Index Scan on the Products table; once that source reaches the end of its data it switches to the next source, the Clustered Index Seek on ProductsHistory, and returns all rows it can get from there as well.

Since the query does not specify an ORDER BY clause, the rows are returned to the screen in the order in which they are processed. This means that you will first receive all rows that have not been changed since the as-of moment, as they are still in the actual table; and because they come from a scan these rows are returned in order of ascending ProductCode. After that follow all rows from the history table, products that have been changed at least one since the as-of moment. And these are returned in order of ascending ValidUntil … in other words, they are ordered not by when they were changed but by whenever the next change for that product was.

I don’t think this specific order is ever useful, so I recommend always using an ORDER BY clause on this type of temporal table, unless you really don’t care about their order.

Additional filtering

The example above simply queried the entire table for data valid at a specific moment in time. That may sometimes be useful. At other times, you might be interested in a point in time for one specific product. Or for some specific products. Since this is SQL, that is easy to accomplish by just adding the appropriate WHERE clause, as in the next example.

DECLARE @AsOf        datetime2 = '2021-04-03T15:45:00',
        @ProductCode char(10)  = 'Prod 4';
SELECT p.ProductName,
       p.SupplierID
FROM   dbo.Products FOR SYSTEM_TIME AS OF @AsOf AS p
WHERE  p.ProductCode = @ProductCode;

This is of course a very simple example. In a more realistic example, where more columns are available in the Products table, one could also use a WHERE clause to find for example all yellow products. But for this post, this basic example with a filter on the primary key suffices.

Adding a WHERE clause to the query obviously changes the execution plan. The most striking change is that the Products table is now read using a Clustered Index Seek instead of the Clustered Index Scan we saw before. This change makes sense. The clustered index used here corresponds to the primary key, and that’s exactly the condition used in the WHERE clause. This part of the query is now very efficient. It uses its Seek Predicates property to pinpoint the row for the specified product, then if it exists verifies its ValidSince and ValidUntil dates to determine whether or not to return it.

The second input to the Concatenation is also a Clustered Index Seek. But before raising our glass to an efficient execution plan, let’s look at how its properties look now.

This might not be what you were hoping for! The Seek Predicates property is still unchanged. This seek operator will read through all rows that have a ValidUntil value larger than the specified as-of moment. And then while reading those rows, it will apply the Predicate that still has the familiar test on ValidSince, and now also has a test on the ProductCode.

This means that if your Products table has millions of rows, and each of those rows has had dozens of changes since the specified as-of time, then this seek operator will read through all the dozens of millions of history rows that meet the Seek Predicates just to find that single one that it needs for the specified ProductCode. (Or even none at all, if the current row was already valid at the as-of moment, because even though the upper branch already returned a row in that case, the execution plan will still search through the history table as well even though it’s impossible for two rows to be valid at the same as-of moment for the same ProductCode.

Some performance considerations

As pointed out in the explanations above, some of the operators used in these execution plans could be suboptimal, depending on the specified as-of date and the data distribution. But this is not the optimizer’s fault. In a way, it can be considered our fault. But only if we understand the process and our role in it.

When a query is submitted that uses FOR SYSTEM_TIME AS OF <date_time>, you can conceptually imagine the process as a two-stop process. (And though not explicitly documented that way, I suspect it is even actually implemented that way). The first step, probably implemented in the parser, is to replace a temporal condition with normal SQL on the base tables. Let’s for example look at the query from the last example:

SELECT p.ProductName,
       p.SupplierID
FROM   dbo.Products FOR SYSTEM_TIME AS OF @AsOf AS p
WHERE  p.ProductCode = @ProductCode;

After the replacement, the result will probably be as follows:

SELECT p.ProductName,
       p.SupplierID
FROM  (SELECT *
       FROM   dbo.Products
       WHERE  ValidSince <= @AsOf
       AND    ValidUntil >  @AsOf -- Not needed; bug?
       UNION ALL
       SELECT *
       FROM   dbo.ProductsHistory
       WHERE  ValidSince <= @AsOf
       AND    ValidUntil >  @AsOf) AS p
WHERE  p.ProductCode = @ProductCode;

Do note that it is quite likely that the actual implementation of this replacement does not work on the original query text, but on the “parse tree”, a symbolic representation of the query that is produced by the parser as input for the optimizer. However, the parse tree is close enough to the original query that you can understand what happens by looking at the equivalent change in T-SQL.

The second step is then the Query Optimizer running its normal optimization process to find the best access patterns for the query, based on available indexes, known data distribution in the statistics, and if applicable sniffed value for the @AsOf parameter. In a query as simple as this one, it’s not hard to predict that the final execution plan will be somewhat equivalent to this query:

SELECT ProductName,
       SupplierID
FROM   dbo.Products
WHERE  ValidSince  <= @AsOf
AND    ValidUntil  >  @AsOf -- Not needed; bug?
AND    ProductCode =  @ProductCode
UNION ALL
SELECT ProductName,
       SupplierID
FROM   dbo.ProductsHistory
WHERE  ValidSince  <= @AsOf
AND    ValidUntil  >  @AsOf
AND    ProductCode =  @ProductCode;

For choosing data access patterns, the collection of available indexes is rather limited. I created the Products table with a primary key, resulting in a clustered index on ProductCode, and a nonclustered index on SupplierID. For the example above, a seek on the clustered index is the obvious choice. Remove the filter on ProductCode (as in the first example in this post) and a clustered index scan remains as the only viable option given the indexes available. If there had been an index on ValidSince, the optimizer would have considered a seek on that index as an alternative option, and either chosen or rejected it based on estimated cost.

On the ProductsHistory table, I created no indexes at all. I supplied a name for the history table when creating the Products table and let SQL Server handle the rest. You might expect that SQL Server would in such a case create similar indexes on the history table as it does on the actual table. You would be wrong. If you let SQL Server create the history table, as I have done, the table will be created with just one index: a clustered index on (ValidUntil, ValidSince). This index is a good choice for avoiding page splits if there are high volume inserts in this table. Every row inserted will have ValidUntil equal to the current date and time (or rather, the date and time the transaction started), so rows will go to the end of the table and not in the middle where they would cause page splits.

But as you have seen in the examples in this post, that default indexing choice might not be best for retrieving the data. Luckily, you are not bound to this choice. You can add extra indexes to the history table after it is created, and you are also free to change or even drop the clustered index that was created by default. Or alternatively, you can manually create the history table, indexed exactly the way you want it, instead of having SQL Server create it for you.

Which indexes are best depends really on your data, your rate of change, and your typical query patterns. Looking at the queries in this example, the first query might or might not have better performance if the order of columns in the clustered index is reversed. The second query would always give much better performance if there is an index with ProductCode as the leading column, either with or without ValidSince and/or ValidUntil added, and either as the clustered or as a nonclustered index.

Choosing indexes is never easy. But now at least you know that a query that uses FOR SYSTEM_TIME AS OF <date_time> is effectively translated into a query that access both the current and the history table, with inequality conditions on the ValidSince and ValidUntil columns in addition to whatever other conditions are in the original query. The optimizer can then give you a great execution plan if you have planned ahead and created great indexes. Which ones those are depends on too many factors to cover here.

One final notice: when designing and creating indexes for a history table, do keep in mind that they should never be defined as unique. Any data that is unique in the actual table might occur multiple times in the history table, if the row was modified a few times. Don’t think that adding ValidSince and ValidUntil to the index columns can make the index unique again. If a row changes multiple times within a single transaction, multiple copies of it can be added to the history table. And they will all be timestamped with the start time of the transaction, not with the actual wall clock time of the change. So you can get multiple rows all with ValidSince and ValidUntil equal to the same timestamp, and any index created as a unique index would cause that transaction to fail.

Conclusion

By using special syntax in the query, we can retrieve data from a temporal table that represents the situation at a certain point in the past. This is not black magic. It’s simply a syntax construction that, under the covers, gets translated to a query on both the “actual” table (with the current data) and the “history” table (with older versions of all rows) that uses a predicate on the start and end timestamps in those tables to get the correct version of each row.

As with all queries, the optimizer tries to generate an execution plan that finds the data efficiently, but it is bound by the indexes it has available. The default index that is used when you let SQL Server generate the history table of a temporal table is optimized to minimize the overhead of adding old data to the history table when rows are changed or removed. For fast retrieval, other indexes might be a better choice. Only you can determine the ideal set of indexes for your workload, because only you know that workload and only you know which queries need to run fast and for which queries slower performance is acceptable.

Do not immediately jump into your systems to review the index structure! There are more ways to query a temporal table than we showed here. I will cover the other temporal constructions in the next blog in this series. Reading that might change your choices on the ideal indexing strategy for your temporal tables.

As always, please do let me know if there is any execution plan related topic you want to see explained in a future plansplaining post and I’ll add it to my list!

T-SQL Tuesday #136 – Float does NOT suck!
T-SQL Tuesday #137 – Use notebooks every day? No way!

Related Posts

3 Comments. Leave new

  • Thanks for some fresh enlightenment, Hugo!

    As we’ve journeyed further with system-versioned tables we we’ve noted some of the same performance in history tables. In some cases we are migrating older audit tables with lots of valid history data so we have to pay closer attention to the history performance. Thanks for highlighting the details. I’m certain everyone will see the value.

    A couple of typos I see. In one case ProductCode is spelled ProducrtCode

    Also in “If you let SQL Server create the history table, as I have done, the index will be created with just one index:” I think you meant “…the TABLE will be created with…”

    Reply
    • Hugo Kornelis
      April 7, 2021 15:22

      Thanks for the comment, Bryant!
      If you use the other temporal query constructions too, you’ll definitely want to check the next post in this series where I plan to go over all remaining options for FOR SYSTEM_TIME. After that I plan to focus on queries that join historic data from multiple sources.

      Also thanks for pointing out those nasty typos. I have immediately edited my post to correct them!

      Reply
  • […] Hugo Kornelis digs into execution plans when retrieving data from temporal tables: […]

    Reply

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Menu
%d bloggers like this:

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.

Close