Storage structures 1 – On-disk rowstore

6 months ago

When a query is slow, it is often caused by inefficient access to the data. So our tuning work very frequently comes down to figuring out how data was read, and then massaging our queries or database structures to get SQL Server to access the data in a more efficient way.

So we look at scans, seeks, and lookups. We know that scans are good when we access most of the data. Or, in the case of an ordered scan, to prevent having to sort the data. We know that seeks are preferred when there is a filter in the query. And we know that lookups represent a good tradeoff between better performance and too many indexes, but only if the filter is highly selective.

All of the above is true. And all of it is highly generalized. And hence, often, not true enough to be actually useful.

SQL Server has, over the years, grown to support a staggering number of different storage structures. We all know about heaps, and about clustered and nonclustered B-tree indexes. But there is so much more. Columnstore indexes. Memory-optimized indexes. Specialized index structures to support specific data types, such as XML, JSON, or vectors. And more.

At a high level, the generic descriptions of scans, seeks, and lookups always apply. But you will only really understand the impact of each of these operators on a specific storage object if you know the type of storage structure … and know the details of how that storage structure is implemented.

This is the first of a series of blog posts where I plan to describe all storage structures that SQL Server currently supports, and explain how that impacts scans, seeks, and lookups.

For some of you, this may already be familiar. But a good refresher never hurts! For others, it may all be new. Hopefully it will make you understand better what goes on in the execution plans of your bad behaving queries!

On-disk rowstore

Until the release of SQL Server 2012, there was only one way to store tables. It did not have a name back then. That method, which is still the default, is now called on-disk rowstore data, to distinguish it from columnstore indexes and memory-optimized tables.

On-disk rowstore data uses three different structure types. Two to store the actual table data: heaps and clustered indexes. And one, nonclustered indexes, to store all supporting indexes. All three of these store data in 8K pages. The number of rows per page depends on how many bytes each row takes.

(There were even in SQL Server 2008 already a few extra types of storage structures for specialized purposes: XML index, spatial index, and freetext index. These will be described in a later blog post).

Heap

For every on-disk table, its data is stored in either a clustered index, or a heap. A clustered index can be explicitly created, or SQL Server can automatically create one for you when you define a PRIMARY KEY or UNIQUE constraint. If none of these methods is used, in other words, if a table has no clustered index, then that table is automatically a heap, also called heap table.

A good visualization of a heap is to take a book. Use a very sharp knife to cut the pages from the cover. Carefully hold the stack of individual pages in your hands. Stand in the middle of the room. Then throw the stack of paper as high as possible in the air, and wait until all the pages have fallen to the floor. You are now looking at a heap of pages from the book.

Generally speaking, that heap of pages has a lot of similarity to heaps in SQL Server. For quickly finding data, this structure is absolutely useless. While there are some cases where a heap is a good choice, they are very rare. When I review a data design, I will only accept a heap if the developer who proposes the design can give me a very good reason why that is the right choice, and why it is really better in this case then a clustered index.

When data is modified, we might run into a situation where the new data no longer fits in the available space on its page. In that case, the data for that row is moved to another page, where there still is enough space available. However, there may be other structures that store the location of the row. Microsoft did not want to take the performance hit of searching the row in all those structures. So instead, it replaced the row data in the original location with a forwarding pointer – a pointer to its new location. Note that, if the row later has to move again, it will not form a chain of forwarding pointers. In such a case, the forwarding pointer in the original location will be modified to point to the new location. However, once a forwarding pointer exists, it will never go away, not even when the data is later modified and would fit in its original location again. The only way to remove forwarding pointers is to rebuild the table.

Because a heap has no structure at all, it is not possible to seek for data. That leaves two options. Reading through all pages, using a Table Scan. Or looking up a specific row, based on its location (found in a nonclustered index), using an RID Lookup.

A Table Scan accesses the data in a heap through the Index Allocation Map. This will read the pages in allocation order, which is close to the physical order on disk. In the age of spinning disk, this was a small alleviation of the pain of heaps. With SSDs, even that small solace is gone.

In practice, though, even that small benefit of scanning a heap was hardly ever actually achieved, unless a table was fully read only. The reason is that, to protect against incorrect results caused by concurrent updates, SQL Server needs to follow each forwarding pointer as soon as it encounters it in a scan. That is then an out-of-order read, that ruins the fast sequential access in a never updated heap.

And please, do not think that a table with “just a few modifications” is hardly affected. The effect of even a relatively low modification rate can be immense! I already wrote about this in 2006. In a table of 20,000 rows, I updated 5,000 randomly chosen rows. So 75% of the data was unchanged. Yet, even this low amount of updates already caused each page in the table to be read, on average, 10 times! This makes the scan highly ineffective.

An RID Lookup finds the data based on the “Row IDentifier” (called Bmknnnn – the mnemonic refers to the pre-2005 term “bookmark”). This is a hidden column that stores the file number (in case of a multi-file database), page number, and slot number (position on the page) of the row. In an ideal case, that means that just one single page needs to be fetched to get the requested data. However, if the row has ever been moved because of an update, then this fetch will only return the forwarding pointer, and another fetch is required to find the location of the actual data.

Modifications in a heap are generally rather cheap. The only somewhat expensive scenario is when an updated row needs to be moved, because in that case, the engine also needs to create or modify the forwarding pointer in its original location.

Heaps can be a good choice for, for instance, staging tables in a data warehouse, or for logging tables that are normally write only, and will only be read in rare situations.

Clustered index

SQL Server allows a maximum of one clustered index on each table. The reason for this maximum is that a clustered index is not an additional structure. It defines how the table data will be stored. And we can only store it in one way.

A good analogy for a clustered index is that same book that I mentioned before, but now before you took your sharp knife and ruined it. So with the pages still in order, bound in its cover. There is a logical order to the pages, as defined by the page number. And that order is then made physical by binding the pages in that order. That makes it easy to search for a specific page, if you know the page number you need.

In SQL Server, the pages are not actually kept in physical order. So that is where the analogy with a book falls flat. Every page contains pointers to the previous page and next page, and the index can be traversed in logical order by following those pointers. The actual storage might be anywhere on the disk, though!

To speed up searches for specific index entries, SQL Server builds an index structure, called a B-tree, on top of this chain of linked pages. This B-tree always has one root page, that breaks down the full range of values in the key columns in a number of smaller ranges. For instance, for a character column, the ranges might be A-H, I-P, and Q-Z. Each of those ranges has a pointer to the data for that range.

Below the root page are zero, one, or more intermediate levels. These index pages have the same structure, and break down their range into even smaller ranges. For instance, the intermediate page for the A-H range might have entries for A-Ano, Anp-Ber, Bes-Chr, etc.

At the lowest level are the leaf pages. These are the pages already mentioned above. They store the individual rows, with all their data. And they are linked in index order by their previous page and next page pointers.

A clustered index must be unique. That does not mean that you need to declare it as such. If you create a clustered index that is not declared as unique, SQL Server will make duplicate entries unique by adding an internal 4-byte column, the so-called uniqueifier.

When we want all or most of the data, we can scan the index with a Clustered Index Scan. But we can also look for specific rows, using a Clustered Index Seek or Key Lookup.

A Clustered Index Scan can read the data in three ways: ordered forward, ordered backward, or unordered. The first starts by navigating through root page and intermediate pages to the (logically) first page, and then keeps following next page pointer. This is guaranteed to return rows in index order. A backward ordered scan is similar, but starts at the last page and follows previous page pointers, to return data in the reverse order of the index definition. The obvious benefit of an ordered scan is, of course, that the data is in order. That can avoid the need for an expensive Sort operator in the execution plan, if that data is a requirement.

Finally, an unordered scan is similar to a Table Scan: it uses the Index Allocation Map to find all pages allocated to the clustered index. Because there are no forwarding pointers, every page is read exactly once. Including the root and intermediate pages, which are not actually needed. But in a typical B-tree index, these represent just a tiny fraction of the pages. While an unordered scan is, for larger tables, a tiny bit cheaper than an ordered scan (if the order of the data is irrelevant), it has a big problem: it does not guarantee correct results if there are concurrent modifications. That is why SQL Server will only actually use an unordered scan if the query locks the entire table, if the table is read-only, or if you explicitly allow incorrect results by using NOLOCK or the READUNCOMMITTED transaction isolation level.

A Clustered Index Seek looks for one or more rows, based on values of the key column(s). It starts at the root page, and finds the correct range. This is repeated for each intermediate page, until a leaf page is found and scanned to find the (first) requested row. An equality search on all columns (including the uniqueifier, if applicable) can return at most one row, this is called a singleton lookup. Any inequality search, as well as any search on a subset of the index columns, can return more than one row. This is called a range seek or partial scan. In this case, the root and intermediate pages are used to find the first matching row, and then the remaining matching rows are found using the next page pointers, just as in a Clustered Index Scan.

A Key Lookup is actually just a Clustered Index Seek doing a singleton lookup. It is called a Key Lookup when the values of the key columns are found in a nonclustered index on the same table. But it does exactly the same.

Because the data in a clustered index is always (logically) ordered, modifications can be more expensive. A high volume of inserts at the end of the index can cause an insert hot spot. Conversely, inserts in random locations of the index, as well as updates that cause the row to grow, can cause page splits: when there is no room on the page for the new row, the data is distributed across two pages, and previous page and next page pointers have to be modified, and a pointer to the new page must be inserted in the appropriate intermediate or root page. And updates that modify key values cause a row to move: it will be deleted from its original location, and inserted at its new location (with of course the risk of a page split).

Generally speaking, having a clustered index on a table is almost always better than storing the data as a heap. While there are many considerations on “the perfect clustered index”, just having any clustered index already beats having no clustered index. That is why, unless you override the default options, SQL Server will automatically create a clustered index when you declare a PRIMARY KEY constraint or first UNIQUE constraint on a table that does not have a clustered index yet.

Nonclustered index

Any on-disk table in SQL Server can have an almost unlimited number of nonclustered indexes.

In the book equivalent, a nonclustered index is like a keyword index in the back of a book. If you want to find information about a specific topic, you use the alphabetic order of the index to quickly find the keyword. To find the actual information, you then use the listed page numbers to find the pages you need to read.

Nonclustered indexes are stored the same way as clustered indexes. So we once more have a B-tree structure with root, intermediate, and leaf pages. And next page and previous page pointers to connect the leaf pages in their logical order. The only difference between the clustered index and a nonclustered index is what exactly is stored in those leaf pages.

Where a clustered index stores the table data (all columns), a nonclustered index only stores a subset of the columns. On each leaf page, you will find: the key columns of that index, the key columns of the clustered index (if any), the RID (if no clustered index), and any columns that were added to the index with the INCLUDE keyword.

Because the structure of a nonclustered index is identical to that of a clustered index, we can also access it in the same ways. We can read all or most of the data with an Index Scan, or we can find specific rows with an Index Seek. Lookups are not supported on nonclustered indexes.

An Index Scan is logically the same as a Clustered Index Scan. But because there are less columns on the leaf pages, more rows fit on each page, which in turn means that scanning the index requires less pages to be read. That can be a huge benefit if the index covers the query (i.e. if all required columns are found on the leaf pages).

An Index Seek is effectively the same as a Clustered Index Seek. Here, the reduced size of the leaf pages hardly ever results in any performance benefit. But it still comes with the same downside of not having all columns available. This might necessitate an Index Seek in an execution plan to be combined with a Key Lookup or RID Lookup to find the remaining required columns.

The considerations above for modifications in a clustered index apply to nonclustered indexes as well. However, due to the typically smaller size of these indexes, page splits and hot spots tend to be less of a problem in most cases.

On the other hand, every additional nonclustered index introduces extra work for the engine when data is modified. Inserted and deleted rows need to be added to or removed from every nonclustered index as well. Updated rows require an update in any nonclustered index that has one or more affected columns in its leaf pages. And those updates can cause the index entry to move if the update modifies one of the indexed columns.

Adding nonclustered indexes is always a cost-benefit trade-off. Indexes can greatly improve performance of queries that search on or order by indexed columns, especially if the index covers the query. But too many nonclustered indexes slows down your modifications, increases database size and hence backup and restore time and index maintenance effort, and increases compilation cost. The art of performance tuning is, for a large part, the art of finding a small enough set of indexes that benefit a large enough selection of queries. Where a long list of INCLUDEd columns can increase the odds of finding an index that covers the query, it also increases the size of that index and the number of modifications that affect it.

Conclusion

On-disk rowstore indexes are the default and most commonly used storage structure in SQL Server. The general recommendation is to default to have a clustered index and only use heaps in very specific cases, and then add a small enough set of nonclustered index to help as many of your relevant queries as possible, without slowing down modifications too much.

Executions plans should ideally use scan operators when all or most of the data in a table has to be read, or when the data has to be read in the order of an index. For queries that filter on specific columns, seek operators on a supporting index can be great. If the filter is selective enough, then a lookup after a seek is often a more acceptable price than an endless list of INCLUDEd columns.

In my next blog post, I plan to look at columnstore indexes, that were introduced in SQL Server 2012 and improved many times since. These are typically associated with large tables in data warehouses.

Hugo Kornelis

Hugo Kornelis is an established SQL Server Community expert. He writes, blogs, speaks, tech edits, and researches, focusing mostly on SQL Server performance and execution plans. He was the technical editor for the third edition of Grant Fritchey's "SQL Server Execution Plans". In 2018 he started a project to document all behaviour of SQL Server execution plans at his website, "the SQL Server Execution Plan Reference" (https://sqlserverfast.com/epr) When not working for the community, Hugo is busy at his day job: freelance database developer/consultant. Hugo has over 20 years of experience on SQL Server in various roles. He has a strong database design background but has since specialized into query tuning and execution plans.

Self-promo: Precon opportunities

Rowgoals, part 1: How do rowgoals work?

No results found.

Storage structures 1 – On-disk rowstore

Storage structures 1 – On-disk rowstore

On-disk rowstore

Heap

Clustered index

Nonclustered index

Conclusion

Like this:

Related Posts

Leave a ReplyCancel reply

Hugo Kornelis

Subscribe via Email

Meta

Archives

Storage structures 1 – On-disk rowstore

Storage structures 1 – On-disk rowstore

On-disk rowstore

Heap

Clustered index

Nonclustered index

Conclusion

Share this:

Like this:

Related Posts

Leave a ReplyCancel reply

Hugo Kornelis

Subscribe via Email

Meta

Archives