Greg Linwood, a fellow SQL Server MVP, has started a series of articles in which he attempts to prove that having a clustered index on each table is not a good practice. However, he has failed to include the effects of fragmentation into account, so I decided to run some tests for myself. One of those test had rather upsetting results.
First a bit of background, for those of you who have never read a book by Kalen Delaney. A table without clustered index (also known as a “heap”) can be read in two ways: using a bookmark lookup or a table scan.
A bookmark lookup is used when a nonclustered index can first be used to narrow down the number of rows to read. Each entry in the nonclustered index holds a pointer to the row in the heap; this pointer is called the RID (short for Row ID), and consists of a file number, a page number, and a slot number. Using the RID, the storage engine can directly fetch the required page and read the row.
A table scan is used when there is no nonclustered index or when statistics indicate that the nonclustered index is not selective enough. A table scan is driven by the IAM (Index Allocation Map), basically a bitmap that indicates which pages in the database are allocated for this table. Table scans always tend to be slow, because each page in the table has to be read and each row processed. This is offset somewhat by the fact that the pages are read in order of page number, so if there is no file fragmentation, movement of the disk heads will be minimal and SQL Server can use the highly efficient read-ahead technology rather than the usual random I/O. Or at least, that’s what I (and many others) always assumed.
Before moving on to the issue of fragmentation, let’s first see this in action:
IF EXISTS (SELECT FROM sys.objects WHERE object_id = OBJECT_ID(N’dbo.Persons’) AND type = N’U’)
DROP TABLE dbo.Persons;
go
CREATE TABLE dbo.Persons
(PersonID int NOT NULL IDENTITY,
FName varchar(20) NOT NULL,
LName varchar(30) NOT NULL,
Email varchar(7000) NOT NULL);
go
— Insert approx. 20,000 rows
INSERT INTO dbo.Persons (FName, LName, Email)
SELECT LEFT(FirstName, 20), LEFT(LastName, 30), EmailAddress
FROM AdventureWorks.Person.Contact;
go
— Dummy table scan to force creation of statistics
DECLARE @Dummy int;
SELECT @Dummy = COUNT()
FROM dbo.Persons
WHERE Email LIKE ‘brenda20@adventure-works.com%’;
go
— Completely clear out the cache
CHECKPOINT;
DBCC DROPCLEANBUFFERS WITH NO_INFOMSGS;
DBCC FREEPROCCACHE WITH NO_INFOMSGS;
go
— Do table scan and show pages read, then show #pages in table.
SET STATISTICS IO ON;
SELECT *
FROM dbo.Persons
WHERE Email LIKE ‘brenda20@adventure-works.com%’;
SET STATISTICS IO OFF;
DBCC CHECKTABLE(‘dbo.Persons’);
go
In case you are wondering about the “dummy table scan” – on my first tests, I got only logical reads, which was odd considering I had just forced the cache to be cleaned. Fellow MVP Steve Kass found the cause: the first time the Email column is used in a search condition, statistics have to be computed, which is done by scanning the table. The dummy scan ensures that the statistics are created before I clear out the cache.
On my SQL Server 2005 server, I got this output (edited for readability):
PersonID FName LName Email
———– ——————– —————————— —————————–
14575 Brenda Lopez brenda20@adventure-works.com
Table ‘Persons’. Scan count 1, logical reads 149, physical reads 0, read-ahead reads 165, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
DBCC results for ‘Persons’.
There are 19972 rows in 149 pages for object “Persons”.
DBCC execution completed. If DBCC printed error messages, contact your system administrator.
There are 149 pages used by the table, and 149 pages are read. This is as expected. I must admit that I don’t know why there are 16 extra read-ahead reads, though.
Updates of rows with variable-length columns are one of the main causes of fragmentation in a heap. SQL Server will try to perform an update in place. This is always possible if the new data occupies less or the same number of bytes as the old data, but if the new data needs more room and there’s not enough space left in the page, the row has to move. SQL Server finds a page with enough free space or allocates a new page, then stores the new data of the row in that page. The data at the original location is replaced with a so-called “forwarding pointer”: a RID that points to the new location of the row. Because of this forwarding pointer, there is no need to change RID pointers to the row in existing nonclustered indexes; when trying to read this row, they can simply follow the forwarding pointer to its new location. And at the new location, a “back pointer” to the original location is added, so that if the row has to move again, the original forwarding pointer can be updated rather than forming an endless chain of forwarding pointers.
Run this code now to simulate how a table might get fragmented over time as its contents are regularly updated. Note that this code does 5,000 random updates; one out of four rows on average will be updated, many rows will never be touched, and a few might even be updated twice or more.
— Simulate update activity to create fragmentation.
— Temporarily add index to improve performance.
CREATE UNIQUE NONCLUSTERED INDEX ix_PersonID ON dbo.Persons(PersonID);
— Do 5,000 random updates to email column, increasing it’s length.
DECLARE @cnt int;
SET @cnt = 1;
WHILE @cnt < 5000
BEGIN;
SET @cnt = @cnt + 1;
UPDATE dbo.Persons
SET Email = LEFT(Email + Email, 7000)
WHERE PersonID = CAST(RAND() * 20000 AS int);
END;
— Drop the index, we no longer need it.
DROP INDEX dbo.Persons.ix_PersonID;
go
There now should be plenty of forwarding pointers in the table. This means that many bookmark lookups will now take two reads instead of just one. But how are table scans affected by these forwarding pointers? I see two possible methods:
1. Follow the IAM from start to finish. Ignore forwarding pointers. Process all rows, both with and without back pointers. Processing is still completely sequential.
2. Follow the IAM from start to finish. When encountering a forwarding pointer, immediately fetch the page that the pointer points to and process that row. When encountering a row with a back pointer, ignore it (it is already or will be processed when the corresponding forwarding pointer is read).
If the table is read WITH (NOLOCK) or in TRANSACTION ISOLATION LEVEL READ UNCOMMITTED, the second method has to be used. Otherwise, a row that is moved during the scan might be included twice or not at all (Paul Randal explains this in a bit more detail in his post about heaps). But in all other cases, method 1 can safely be used since the table lock issued by the table scan effectively prevents any modification while the scan is running.
So let’s see what method SQL Server picks:
— Do table scan and show pages read, then show #pages in table.
CHECKPOINT;
DBCC DROPCLEANBUFFERS WITH NO_INFOMSGS;
SET STATISTICS IO ON;
SELECT *
FROM dbo.Persons
WHERE Email LIKE ‘brenda20@adventure-works.com%’;
SET STATISTICS IO OFF;
DBCC CHECKTABLE(‘dbo.Persons’);
go
The relevant part of the output on my system (note that numbers vary slightly between runs, due to the use of
Table ‘Persons’. Scan count 1, logical reads 1892, physical reads 0, read-ahead reads 222, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
DBCC results for ‘Persons’.
There are 19972 rows in 174 pages for object “Persons”.
The number of pages occupied by the table has grown a fair bit, but that was to be expected after doubling 5,000 email addresses. The troubling part of the output is the number of logical reads. Apparently, the average page in the table is read 10 times! This shows that, even though the default READ COMMITTED isolation level makes it unnecessary, SQL Server will follow method 1.
Reading the same page will of course significantly impact performance. This gets even worse when the size of the table exceeds the available server memory so that not all pages fit in the data cache. Since the order in which pages are accessed will be completely random (for instance, the first page holds forwarding pointers to pages 6632, 91, 93055, 17, 494, and 220568; the second page holds forwarding pointers to pages 902, 14, 6632, 1905408, and 3, etc.), pages requested will probably just have been removed from cache so that they have to be read from disk once more. Because of the forwarding pointers (and an apparent oversight from Microsoft), a table scan on a heap might come with an amount of physical I/O that is a multiple of the actual size of the table. And there is no easy way to fix or prevent this, since heaps (unlike table scans) can not be unfragmented. The only option is to periodically copy over all data to a new table, as that is the only way to get rid of the fragmentation.
Bottom line: Beware of table scans, and beware of the effects of fragmentation on a heap. Even though I agree with Greg that some tables don’t need a clustered index, those are the exception. So if you choose to use rules of thumb, make sure that having a clustered index for each table is one of them. It might not be optimal, but it’s a lot better than not having a clustered index for any table!
9 Comments. Leave new
Hi Hugo,
Very interesting. I think you are making good point, but in fact, there is a way to get rid of fragmentation for a heap table:
—— populate with 10000 rows
SELECT Number, CAST(‘s’ AS VARCHAR(890)) AS v
INTO Test
FROM Numbers
go
UPDATE TEST SET v = REPLICATE(‘z’, 890)
go
UPDATE TEST SET v = ‘shrunk’
— now it is horribly fragmented
go
DBCC CHECKTABLE(‘dbo.Test’)
go
/*
DBCC results for ‘Test’.
There are 10000 rows in 1272 pages for object ‘Test’.
DBCC execution completed. If DBCC printed error messages, contact your system administrator.
/
go
CREATE UNIQUE CLUSTERED INDEX T1 ON Test(Number)
go
DROP INDEX Test.T1
go
DBCC CHECKTABLE(‘dbo.Test’)
go
DROP TABLE Test
/
DBCC results for ‘Test’.
There are 10000 rows in 29 pages for object ‘Test’.
DBCC execution completed. If DBCC printed error messages, contact your system administrator.
*/
Hi Alex,
You are right, creating and dropping a clustered index will indeed remove all fragmentation from a heap. I had not thought aboout that.
Of course, the overhead of creating and removing a clustered index on a big table is enormous – all data has to be moved, and all nonclustered indexes have to be redone as well – and all this twice!! Unless your database has a very long maintenance window, there’s no way to do this without significantly impacting performance.
Best, Hugo
Hi Hugo
Thanks for the response to my blog – it’s good to get some public discussion going in a forum
others can contribute to or learn from. I’d also like to point out that I hadn’t "forgotten" to
discuss fragmentation – I simply hadn’t got around to that topic yet. Here are some thoughts in
response to this thread, as spilled out on a lazy sunday afternoon…
First, the overhead of creating and removing a CIX is certainly more than simply rebuilding a CIX
but it’s not as enormous or impossible as you’ve stated. The main overhead with the approach
described by Alex above isn’t the burden of rebuilding associated NCIXs. Whilst there certainly is
overhead associated with this, by far the bigger burden is usually the sort operation that’s
required to sort the un-ordered Heap storage into the order of the CIX key/s. Depending on the size
of the Heap, this can be very significant. Rebuilding NCIXs is usually relatively less effort,
simply due to their much smaller size. The ratio of importance between the CIX sort effort & NCIX
rebuilds is a factor of how many NCIXs the table has, so there are no golden rules but in my
experience, by far the bigger factor is usually the CIX sort.
Second, CIXs can be created online with SQL 2005 EE, so I have no idea how you came to the
conclusion that building a CIX on a Heap will significantly impacting performance. Whilst there is
a performance overhead associated with indexing online, it shouldn’t significantly impact
performance. Sure, not everybody can afford EE, but most who run systems that truely need to be
available 24×7 should be able to.
Third & most important of all, you only need to defrag tables if your queries are scanning those
tables. If you provide good indexes in the first place, your queries shouldn’t need to scan your
tables other than smaller range scans. Larger range scans should be serviced by dedicated NCIXs.
Assuming you have decent indexes & your queries arne’t scanning your tables, table defrags
shouldn’t be required very frequently. I was actually planning to write about this in my next few
blog entires & will do so more fully when time permits..
Fourth, whilst the points you’ve made about forwarding pointers are theoretically correct, the
example you’ve provided really is an unrealistically cook up. Your script performs en-mass widening
updates in a way that doesn’t happen in any real world systems that I’ve seen. Can you provide any
real-world examples of where such behaviour actually occurs? In nearly all real world systems, the
occassional widening updates can be easily managed via fill-factor to provide reasonably long
maintenance windows. Whilst your do describe what can potentially happen if you deliberately set
out to abuse the technology, I can’t imagine many systems suffering these hypothetical effects.
When you consider that fragmentation in Heaps can be easily managed with fill-factor, defrag’d with
the approach provided by Alex or rebuilt online, your points about fragmentation effects from Heaps
aren’t significantly important. I’m always surprised about how hysterical people seem to get about
table fragmentation when the real performance killer is index fragmentation.
Lastly, CIX bookmark lookups are definitely a major performance killer in real-world systems & this
is important to take into consideration if you susbecribe to the view that every table should have
a CIX. Sure, the best answer to avoiding bookmark lookups is to provide good indexes in the first
place, but bookmark lookups worsen the penalty for those who don’t provide the right indexes (&
designing good indexes is an advanced skill). You stated in your post "A bookmark lookup is used
when a nonclustered index can first be used to narrow down the number of rows to read" but you
didn’t mention the far worse scenarios where bookmark lookups are used to narrow the number of rows
to read. When filters are evaluated accross bookmarks lookups, the penalty is huge but would be
significantly lessened if no CIX existed to force the bookmark lookups. This is a very common
performance killer scenario – the CIX bookmark lookups usually multiply the IO (& locking) involved
by a factor of 2 to 3 times what would be needed if the table wasn’t implemented on a CIX.
I think it’s also interesting to take into consideration that SQL Server is the only major DBMS
platform which doesn’t currently provide some form of "direct-to-block" I/O method, providing only
bookmark or RowID lookups instead. In my opinion, this is a substantial over-sight from MS which
has contributed to SQL Server falling significantly behind in TPC-C to other platforms which do
provide these more efficient lookup mechanisms. Note that Oracle achieved 1.6M tpm in TPC-C on HALF
the number of CPUs to SQL Server’s 1.2M tpm. Note also that Oracle were using their more efficient
hash cluster technology in achieving this rather than their equiavlent technology to SQL Server’s
CIX (Index Organised Tables). I reckon much of that extra CPU used in SQL Server’s TPC-C result was
thrown away processing (reading & locking) inefficient bookmark lookups. If you’ve got a better
idea about what’s going on there, I’m all ears..
IMO, Microsoft needs to work on adding more sophisticated lookup methods to SQL Server. We’ve been
working with the current tools for too long now & they haven’t changed substantially since 1998.
I’m definitely a big fan of SQL Server, but in this area I believe MS needs to get their act
together with some new block / page access technology & stop encouraging excessive over-use of
their inefficient Clustered Index technology.
Regards,
Greg Linwood
What about on a mulit-column clustered index? My experience has been exceptional in that M$ has definitely had their stuff in gear. Best practices indicate the almost never do we have a single column clustered index "key". At least that’s my experience and my $0.02. Have not run the metrics yet, but very impressed with in-place production examples on large volume db’s in both SQL 2000 and 2005 versions. Of course, my production systems rarely exceed 20m records.
And we all know that in SQL Server 2000 there is no REBUILD WITH (ONLINE=ON) option for rebuilding indexes. Why do you think most people are rebuilding indexes at 2AM Sunday morning?
Denis
Excellent information. I am doing some work with the dynamic management views, and you can really see this with the dm_db_index_operational_stats object. After running your first script, I ran:
select cast(object_name(ddios.object_id) as varchar(30)) as objectName,
ddios.forwarded_fetch_count
from sys.dm_db_index_operational_stats(db_id(),object_id(‘dbo.Persons’),null,null) as ddios
And it returned:
objectName forwarded_fetch_count
Persons 0
Then I ran the fragmenting script, and reran and got:
objectName forwarded_fetch_count
Persons 3204
What is most disturbing though, is after running a simple:
Select *
from dbo.Persons
objectName forwarded_fetch_count
Persons 4676
And again:
objectName forwarded_fetch_count
Persons 6148
That’s 1472 extra operations, just for a simple select * from a 19972 row table!
Hi
We are getting up to 99% fragmentation in only bulk insert (24X7) for non clustered indexes in clustered indexe’s table any way to reduce it?
Greg,
Defragmenting a heap by creating and dropping a clustered index.
The entire table must be rewritten twice. Once for the sort and once for the new table. COnsider offline bulk copy out and in instead.
No LOB types are permitted for an online rebuild. So no MAX types, XML, and so on. Quite a restriction.
Nonclustered indexes must be maintained, requiring yet more time and space, including the extra mapping table.
If nonclustered indexes are dropped first, then in what sense is the operation online?
Online index builds almost always result in greater remaining fragmentation than offline builds.
Building a non-unique clustered index will result in uniqueifiers being added to any nonclustered indexes temporarily.
Buildinding a unique clustered index will abort if a user inserts a duplicate value which the online process does not process first.
Online index building is not for free. In real-life production systems with a heavy OLTP load, online is not practical.
These are some of the reasons that "building a clustered index on a heap" will significantly affect performance. There may be others.
Your point about avoid scans on heaps is a good one, as is the remark about widening updates. However, real life systems are not always as well-designed as we might like. Missing non-clustered indexes, 100% fillfactors, widening updates and so on are all things that will happen. A lot of this is down to old code which was written in a hurry, is complex and hard to maintain, and frankly people are too scared to touch it. In situations like these, adding a clustered index to the heap permanently will make the problems go away. Bear it mind that it has been shown over and over again that heaps are inferior when it comes to INSERT, UPDATE and DELETE performance – the lifeblood of OLTP.
As for a real-life example of widening updates, please feel free to pop across the ditch to see how my workplace uses transactional replication, and how often widening updates occur. This is on a system serving 60M web page impressions per day. Setting a low fill factor to work around the weakness reduces data density – less real data fits in the available pages in the buffer pool for example.
The heap versus clustered index debate was over a long time ago. Clustered indexes by default, heaps in very rare circumstances, following extensive testing and justification. Sure, both have their weaknesses, but the list is very much longer on one side. Attempting to dismiss the technical inadequacies by saying they are not important or significant is churlish, borderline arrogant. There will be systems out there where these issues are very important indeed, on a daily basis.
The point about overhead on key lookups versus RID lookups is interesting, but ultimately flawed. The extra operations will always be on data already in memory, so the only overhead we need concern ourselves with is CPU. This is cheap and getting cheaper. For the very very small number of instances where CPU is a problem, there must be so many key lookups that it seems the clustered index must be on the wong column. Range seek on the clustered index for very large tables – it is what it is good at. On smaller tables, clustering on an identity column is pretty good practice, overall.
In your comparison with Oracle – I’m afraid it left me comparing the list prices and considering the value of benchmarks. I note that you fail to mention the much larger page size used by Oracle in those tests.
Apologies for the long reply. You’re a smart guy Greg, with lots of good insights, knowledge and experience, but this holy war on clustered indexes does you no good at all I’m sad to say.
Cheers,
Paul
thank you very much kaydesim..