Enabling compression on your database can save you a lot of space – but when you have a lot of varchar(max) or nvarchar(max) data, you may find the savings to be limited. This is because only data stored on the data and index pages is compressed, and data for the (max) data types is generally stored on other, special-purpose pages – either text/image pages, or row overflow data pages. (See Understanding Pages and Extents in Books Online). This is from the SQL Server 2008R2 Books Online, but it is still valid in SQL Server 2012 – but apparently, this page has been removed from newer Books Online editions).
So why does SQL Server not compress the data that, perhaps, would benefit most from compression? Here’s the answer.
SQL Server currently supports two compression methods for data in the database (backup compression is out of scope for this post).
- Row compression: This is a simple algorithm to save storage space for individual rows. It has two elements. The first is a more efficient way to store the per-row metadata, saving a few bytes per row regardless of layout and content. The second element is storing almost all data types, even those that have a fixed length, as variable length. This mainly has benefits for the larger numerical types (e.g a bigint with a value of 1,000 is stored in two bytes instead of eight – only values that actually need all eight bytes do not gain from this, and will instead take up more space because the actual length has to be stored somewhere) and for fixed-length string types with lots of trailing spaces. For Unicode data, the SCSU algorithm is used, which saves 15% to 50% depending on the actual content of the column. (According to Wikipedia, the SCSU standard has gained very little adoption because it is not as effective as other compression schemes).
See Row Compression Implementation and Unicode Compression Implementation in Books Online.
- Page compression: When enabled, page compression is done after row compression. As the name implies, it’s done on a per-page basis. It consists of two steps:
- Prefix compression. Within each column, the longest common prefix is used to build the “anchor record”. All columns than only indicate how many characters of the anchor value they use as prefix. So for example, if we have a first name column with the values Roger / Hugo / Hugh, the anchor value could be Hugh, and the data values would be stored as {0}Roger / {3}o / {4}. (Here, {3} is stored as a single byte, and {3}o means: first three characters of Hugh, followed by an o).
-
Dictionary compression. Accross the entire page, columns that are now stored with the same bit pattern are replaced with a single value that points to the dictionary entry. Let’s assume that the same page I use above also has a Lastname column, with values Plowman / Kornelis / Ploo. Here, Plowman would be the anchor value, and the data after prefix compression would be {7} / {0}Kornelis / {3}o. The dictionary encoding would then see that there is a {3}o in the population of the Firstname columnm and a {3}o in the population of the Lastname column. It would place {3}o as the first entry in the dictionary and replace both {3}o values with the reference [1].
See Page Compression Implementation in Books Online.
All elements of page compression save space by eliminating repeated data between different column values, so they will only work when multiple values are stored on a page. For all LOB pages, the reverse is the case: a single value spans multiple pages. So by definition, page compression can never yield any benefits.
For row compression, the more efficient storage of per-row metadata naturally only affects pages that have per-row metadata stored – data and index pages, but not LOB pages. And the conversion of fixed length to variable length data types also doesn’t affect LOB pages, since these can only be used for varying length data.
Based on the above, it is obvious why SQL Server does not compress varchar and varbinary data stored on LOB pages – there would be zero benefit from any of the implemented compression methods. But how about Unicode compression for nvarchar(max) and overflowing nvarchar(nnn) data? Wouldn’t that save some space?
To answer that, I now have to go into speculation mode. And I see two possible theories:
- Because the SCSU standard saves less spacing than other algorithms, the SQL Server team deliberately made this choice in order to encourage people to compress these large values in the client before sending them to the server, thereby reducing not only storage space (by more than SCSU would have yielded), but also network traffic. The down side of this is that cool features such as Full-Text Search and Semantic Search don’t understand data that was compressed at the client – at least not without a lot of extra effort.
-
Since all compression algorithms work on a per-page basis, they had a choice between either first breaking the LOB data into pages and then compressing (which makes no sense, as the rest of the page would remain empty and the amount of space actually used remains the same) or creating a separate algorithm for LOB data to first compress it and then split it over multiple pages. That would of course have cost a lot of extra engineering hours, and if my understanding of SCSU is correct, it would also have a big adverse side effect on operations that affect only a part of an nvarchar(max) value (like SUBSTRING or the .WRITE method of the UPDATE statement). That is because SCSU works by traversing the entire string from left to right and can’t handle operating on only a subset of the string.
Bottom line: When you have to store large values and you want to save on storage size, your best course of action is probably to compress and decompress the values on the client side. But do beware the consequences this has for Full Text Search and Semantic Search!
Final note: I didn’t spend as much time on this blog post as I normally do. That’s because this actually started as a reply to a question on an internet forum, but when I was busy I realized that the reply was long enough to be promoted to a blog post.
12 Comments. Leave new
Thanks!
I think a much simpler answer is: "because the SQL Server team didn’t implement the feature due to it not meeting some internal priority bar." It absolutely could be done — there’s no major technical barrier — and it probably would by very useful for a lot of people. But they just didn’t do it.
@Adam: Of course it can be done. But can it be done without having a big adverse effect on performance?
All the algorithms I know for effective compression of a single stream of bytes work on the basis of either a dictionary that is explcitly stored as part of the compressed data (e.g. Lempel–Ziv–WelchHuffman), or that is kept in memory only and can be reconstructed from the compressed data (e.g. Lempel–Ziv–Welch). Both categories have problems with SUBSTRING and chunked updates (the .WRITE method of the UPDATE statement).
The first category always needs to read the dictionary in addition to the pages where the required data for a SUBSTRING lies, so more IO is needed for a SUBSTRING. It might even be necessary to read and decompress everything to the left of the substring, because I don’t think it’s possible to predict where exactly the 80,000th character of a compressed varchar(max) starts. For .WRITE, the same problem exists. And after too many .WRITE updates, the compression ratio will drop unless the dictionary is recalculated and the entire string is de- and then recompressed. Plus, what happens if I use .WRITE to replace a few character, but the replacement data doesn’t compress as well as the original data?
In the second category, it is even completely impossible to decompress a substring without first processing all the preceding bytes. And changing a chunk changes the dictionary, so any change would force the engine to de- and recompress everything that follows the changed part. Using .WRITE to replace a few characters near the start of a 750MB string can become very costly that way!
Rather than compressing a 750MB string as a single stream, it would probably better to compress it in separate 1MB chunks which would then mean that SUBSTRING & .WRITE would only have to decompress/recompress a smaller section of the data.
@Peter excellent idea. Combined with fast algorithm, e.g. LZ4 (http://code.google.com/p/lz4/), can yield substantial I/O savings.
@Hugo I think compression is not for everybody. If you have an app that often updates 750MB row of data, then of course it’s best to avoid compression. But SQL should support compression in scenarios where .WRITEs are used rarely or not used at all.
So, we have cloud backup in SP1 CU2, may I expect LOB compression in SP1 CU3? 🙂
Agreed with Andrew. The .write method is not especially common; I’m not sure I’ve seen it used more than one or two times. Make it a limitation of the feature and 95+% of people storing LOB data will be perfectly happy.
The fact is that some scenarios would be conveniently addressed by being able to turn on a much stronger compression scheme that also works for LOB data. The disk space, buffer pool efficiency and IO savings can be worth the (enormous) CPU cost.
SQL Server doesn’t have this feature today but I hope they someday get around doing it. Maybe they are waiting until they have variable page size implemented so they can compress 1MB chunks at once.
With a page size of >1MB one could use the same compression algorithm that is already used for backups. Backups would then not even need to compress again.
Small-chunk compression is the foundation of Column Store Indexing, which we know is very efficient and provides dramatic IO reduction and thus performance increases (even without Batch Mode Execution). And like Adam I too have hardly EVER seen user apps doing LOB updating in the wild.
And what about the extra time/metadata (storage for that) required to track/process these separate "1MB chunk"s, esp if your updates routinely span these chunks (very likely, since 750MB was warranted in the 1st place)?
[…] and page-level compression only works if your data fits on a page rather than falling into LOB. Hugo Kornelis covers why that is. So that’s a weakness, which means I need a bulleted list […]
Like @Adam, I also agree with @Andrew, who said: “But SQL should support compression in scenarios where .WRITEs are used rarely or not used at all.”. Very few people will be adversely affected by the no-.WRITE() limitation, and those that are currently can’t compress that data so they are no worse off and might not even notice.
Regarding your statements
“it would also have a big adverse side effect on operations that affect only a part of an nvarchar(max) value (like SUBSTRING or the .WRITE method of the UPDATE statement). That is because SCSU works by traversing the entire string from left to right and can’t handle operating on only a subset of the string.”
and (in a comment):
“It might even be necessary to read and decompress everything to the left of the substring, because I don’t think it’s possible to predict where exactly the 80,000th character of a compressed varchar(max) starts.”
1) any issues caused by SCSU compression needing to start at the beginning and read everything until the desired spot would still be issues even for SCSU compressed, non-MAX NVARCHAR data, right?
2) it’s not even possible to predict where the 10th character is in a double-byte character set (for VARCHAR data) or when using an “_SC” (i.e. Supplementary Character support) collation (for NVARCHAR data), even assuming non-MAX data. Because in both cases, a “character” can be encoded with two code units. So, while you can tell, at least in the NVARCHAR “_SC” collation case, whether or not the 10th code unit is the 2nd code unit of a surrogate pair, you have no way of knowing if any of the previous “characters” used two code units or not. The 10th code unit (in terms of SUBSTRING) could be the 10th BMP “character”, or it could be the 2nd code unit of the 5th supplementary character, or it could be the 1st code unit of the 5th supplementary character if there was one BMP character to the left of it, and so on.
Hence, “needing to read everything to the left of the substring” can’t be the objection because it also applies to the non-MAX data that can be compressed.