Today's post is based on a bunch of questions I've had around the various compression features in SQL Server 2008.

Does turning on data compression or backup compression compress the transaction log files on disk?

No. The transaction log is not compressed in any way under any circumstances. Rows from tables and indexes that have compression enabled will be logged in their row compressed forms, even if page compression is enabled.

Does data compression compress LOB columns?

No. There is no native compression support for any LOB columns (n/text, image, n/varchar(max), varbinary(max), XML), whether stored in-row or out-of-row. There's also no native compression support for FILESTREAM data.

Does log shipping use compression to compress the logs being shipped?

Log shipping does not ship transaction logs - it ships log *backups*. If backup compression is enabled for the instance hosting the log shipping primary database, or the log shipping job is changed to enable backup compression, then the log backups will be compressed and less data will be sent over the wire to the log shipping secondary(s).

Is backup compression the same as log stream compression with database mirroring?

No. Backup compression compresses backups (see my previous blog post here). Log stream compression with database mirroring compresses transaction log records before sending them between the principal and the mirror (see my previous blog post here).

Should I just turn on backup compression at the instance level?

Not necessarily. It depends whether the majority of database on the instance will benefit from backup compression. Backup compression (and any compression algorithm) uses CPU whether a decent compression ratio is achieved or not. Check what compression ratio is achieved first and then enable backup compression if its worth it. Otherwise, just enable it for individual databases.

Does data compression use the same algorithm as backup compression?

No. Backup compression uses a proprietary block-based compression algorithm that is part of Windows. Data compression uses up to 3 algorithms, depending on the level of compression configured. Row compression just makes all non-LOB columns into variable-length columns. Page compression does row compression, then common-prefix compression for each table columns, then common-value dictionary compression for each page. Details can be found at the following BOL sections: Row Compression Implementation  and Page Compression Implementation.

Hope this helps!

On Tuesday we had a look around our customer's data centers here in Austria - hidden away in the labryinthine bowels of a very large building in Vienna. Typical data centre with a halon fire extinguishing system but exceptional in its neatness and organization. The star of the show was their new HP Superdome - 32 dual-core Itaniums with 1/2 a terabyte of memory. Very nice - providing a lot of headroom for their workload to grow (currently at four hundred thousand SQL statements per *second* - spelled out to show there's no accidentally added zeros).

One of the systems they showed us is responsible for doing backups. They stripe the backups across 12 devices using multiple network cards and can manage to backup 2 terabytes in two hours after tweaking the BLOCKSIZE, BUFFERCOUNT, and MAXTRANSFERSIZE! Now comes the cool(er) part - with backup compression on SQL Server 2008 they've benchmarked backing up 2 terabytes in 36 minutes! That's a pretty awesome number and makes for some excellent disaster recovery times.

(Details printed with permission)

During early SQL Server 2008 development, the Storage Engine team introduced additional minimally-logged functionality. And then it was removed again for RTM. Now it seems that they've bowed under customer pressure and have made the functionality available again (see Sunil's announcement here).

Using trace flag 610 in the RTM build, you enable the potential for minimal-logging when:

  • Bulk loading into an empty clustered index, with no nonclustered indexes
  • Bulk loading into a non-empty heap, with no nonclustered indexes

Sunil's previous blog post here gives more info on the required syntax - very useful!

The sparse columns feature in SQL Server 2008 is generating lots of interest from people looking to deploy extensible schemas. I've seen a few questions from people that are confused by some of the info in Books Online, particularly about adding and removing sparse columns from a table.

There's a section in BOL that states that sparse columns are added and removed from existing tables by creating a new copy of each row *on the same page* and then deleting the old row, and that this can fail when the row size is around 4009 bytes (1/2 the max row size when sparse columns exist). I've been trying to repro this behavior as it seemed a nonsensical design to have used (and the design was done after I'd left Microsoft so I've never seen the underlying code) - and I couldn't. I finally got around to discussing this with the dev team last week and had it confirmed that Books Online is indeed incorrect - there is no such issue with sparse columns. I've been told that BOL will be corrected.

One other issue that's come up is whether sparse columns work with row overflow (i.e. rows greater than the size of a page, where one or more variable-length columns are pushed into off-row storage). The answer is yes, it works just the same as when the column isn't sparse.

Hope this helps some of you.

The feature article on SQL Server 2008: Tracking Changes in Your Enterprise Database I wrote for the November issue of TechNet Magazine is live on the web. It also includes a 5 minute long screencast I recorded where I demo the change data capture feature.

You can get to the article at http://technet.microsoft.com/en-us/magazine/cc987538.aspx. The topics covered are:

  • The need for tracking changes
  • Tracking changes in SQL Server 2005
  • Change tracking in SQL Server 2008
  • Change data capture in SQL Server 2008

It's written around 2-300 level and presents a good overview of both features, as well as a comparison between them. It's written for DBAs and ITPros so does not go into depth on how to program with either feature.

Checkout my previous TechNet Magazine articles and Q&A columns at http://www.sqlskills.com/blogs/paul/category/TechNet-Magazine.aspx.

Enjoy!

Just found out that the 25-page FILESTREAM whitepaper I wrote recently for the SQL team has been published on MSDN.

You can get it at http://msdn.microsoft.com/en-us/library/cc949109.aspx.

Enjoy!

Here's the table of contents.

  • Introduction
  • Choices for BLOB Storage
  • Overview of FILESTREAM
    • Dual Programming Model Access to BLOB Data
    • When to Use FILESTREAM
  • Configuring Windows for FILESTREAM
    • Hardware Selection and Configuration
    • Physical Storage Layout
    • RAID Level Choice
    • Drive Interface Choice
    • NTFS Configuration
      • Optimizing NTFS Performance
      • Cluster Size
      • Managing Fragmentation
      • Compression
      • Space Management
      • Security
    • Antivirus Considerations
    • Enabling FILESTREAM in Windows
  • Configuring SQL Server for FILESTREAM
    • Security Considerations
    • Enabling FILESTREAM in SQL Server
    • Creating a Database Enabled for FILESTREAM
    • Creating a Table for Storing FILESTREAM Data
    • Configuring FILESTREAM Garbage Collection
    • Partitioning Considerations
    • Load Balancing of FILESTREAM Data
    • Feature Combinations and Restrictions
  • Performance Tuning and Benchmarking Considerations
  • Data Migration Considerations
  • FILESTREAM Usage Best Practices
  • Conclusion

At the last Connections conference in April, the conference organizers tapped a bunch of speakers for interviews, articles, and other content to put into a free "newsletter" called MyDevConnections, and now it's finally available. It covers all the Connections conferences, so isn't just limited to SQL Server. As far as SQL is concerned, Kimberly and I wrote an article about our favorite SQL Server 2008 features (data compression and filtered indexes, respectively), and Ross Mistry wrote an article on hardening a SQL Server 2005 installation. There's also an extract from a Women in Technology interview that Kimberly took part in. Overall there's 84 pages of content in the PDF, with some adverts for the Connections shows (obviously).

Check it out at http://www.devconnections.com/mydevconnections/S08_DevOnlineMag_Web.pdf

This is the second of the two sessions I recently recorded with Richard and Greg on RunAs Radio (the first one on being an "involuntary DBA" is here). I've just finished the final edited version of a whitepaper for Microsoft on the FILESTREAM feature of SQL Server 2008 and this session goes into details of why you'd want to use it and how to setup a system for optimal FILESTREAM performance. The whitepaper should be available before PASS in November, in the meantime, checkout the show!

The show is 35 minutes long and you can download it at http://www.runasradio.com/default.aspx?showNum=74.

Enjoy!

I just heard today that the first whitepaper I've written for Microsoft has been published!

The abstract is:

SQL Server Replication: Providing High-Availability using Database Mirroring

This white paper describes how to use database mirroring to increase the availability of the replication stream in a transactional environment. It covers setting up replication in a mirrored environment, the effect of mirroring partnership state changes, and the effect of mirroring failovers on replication. In addition, it describes how to use LSN-based initialization to recover from the failover of a mirrored subscriber database.

Although brief overviews are given of both replication and database mirroring, it is easier to understand this white paper if the reader has some experience with one or both of these technologies, and has at least a rudimentary knowledge of database concepts such as transactions.

You can download it from http://download.microsoft.com/download/d/9/4/d948f981-926e-40fa-a026-5bfcf076d9b9/ReplicationAndDBM.docx and I'll put a link in our whitepapers page.

Enjoy!

(Quickie post #1 while it's Kimberly's turn to lecture this morning...)

I had a comment on my post dealing with suspect databases, saying that the 2008 behavior is much better. I hadn't tried this so it was (really good) news to me.

So, trying the same steps from that blog post on 2008 - everything's the same up to the point where the server has been restarted and the database is suspect. Now, if I try to detach the corrupt database, I get the following:

EXEC sp_detach_db 'DemoSuspect';
GO

Msg 3707, Level 16, State 2, Line 1
Cannot detach a suspect or recovery pending database. It must be repaired or dropped.

VERY COOL!!!! This should *really* cut down the number of times people accidentally get themselves into trouble by detaching the database.

For the sake of completeness, if I really want to detach the suspect database, I can do it in 2008 if I put the database into EMERGENCY mode first (doesn't mean I would though!)

When I started blogging, way back in 2006 :-), the third post I made on the old Storage Engine blog was about rebuilding the msdb database in 2005 (see here). This no longer works in 2008 (fellow MVP Tibor Karaszi explains why here), and in fact the information in Books Online about how to rebuild any of the 2008 system databases is incorrect. It's not something I've tried yet but people have already needed to do it (including Tibor!). So what to do?

Step in Bob Ward, a Principal Escalation Engineer with PSS, and a very good friend of mine. He's just researched and published a comprehensive blog post giving the procedure for rebuilding the system databases in 2008 using setup.exe. Checkout his great post here.

Here's hoping that you never need to do it!

Over the last few weeks I've had lots of questions about FILESTREAM performance and how to coax NTFS into scaling nicely. I just finished writing a 30-page whitepaper about FILESTREAM for the SQL Server team, which should be published before PASS in November (I'll blog the link when I have it). Although my whitepaper isn't strictly about performance, there is a long section about setting up your system to get high-performance from FILESTREAM. What I want to do in this blog post is give a bullet list of things to do that will help you get good performance. All of these are explained in more detail in the whitepaper.

Here you go, in no particular order:

  • Make sure you're storing the right-sized data in the right way. Jim Gray (et al) published a research paper a couple of years ago based called To BLOB or Not To BLOB: Large Object Storage in a Database or a Filesystem?. To summarize the findings, BLOBs smaller than 256-KB should be stored in a database, and 1-MB or larger should be stored in the file-system. For those in-between, "it depends" - my favorite answer. The upshot of this is that you won't get good performance if you store lots of small BLOBs in FILESTREAM.
  • Use an appropriate RAID level for the NTFS volume that will host the FILESTREAM data container (the NTFS directory structure corresponding to the FILESTREAM filegroup in the database). Don't use RAID-5, for instance, for a write-intensive workload.
  • Use an appropriate disk technology. SCSI will be usually be faster than SATA/IDE, but more expensive. This is because SCSI drives usually have higher rotational speeds, so lower latency and seek times.
  • Whichever disk technology you choose, if SATA, ensure it supports NCQ, and if SCSI, ensure it supports CTQ. Both of these allow the drives to process multiple, interleaved IOs concurrently.
  • Separate the data containers. Separate them from each other, and separate them from other database data and log files. This avoids contention for the disk heads.
  • Defrag the NTFS volume if needed before setting up FILESTREAM, and periodically to maintain good scan performance
  • Turn off 8.3 name generation on the NTFS volume. This is an order-N algorithm that has to check that the new name generated doesn't collide with any existing names in the directory. This slows insert and update performance down *a lot*. Do this using the command line fsutil utility.
  • Turn off tracking of last access time using fsutil.
  • Set the NTFS cluster size appropriately. For BLOBs 1-MB or large, use a cluster size of 64-KB. This will help to reduce fragmentation.
  • A partial update of FILESTREAM data creates a new file. Batch lots of small updates into one large update to reduce churn.
  • When streaming the data back to the client, use an ~60-KB SMB buffer size (or multiples thereof). This is so that the buffers don't get overly fragmented as TCP/IP buffer are 64-KB.

Hope this helps!

Moving databases around is pretty common, as is moving databases between servers running different Editions of SQL Server, especially during a disaster recovery situation. You may not know this, but in SQL Server 2005, if you had partitioning anywhere in a database, you could only attach/restore that database using an Enterprise or Developer instance. I say 'you may not know this' because partitioning isn't really widely used (from what I can gather from talking to customers). The upshot of this is that SQL Server 2005 will refuse to attach/restore a database with partitioning in - even if you're in a disaster recovery situation and the only server you have available has Standard Edition.

In SQL Server 2008, a lot more people are going to bump into this issue. The list of features that are Enterprise only, and will prevent attach/restore using a lower edition has expanded to include 3 new features that WILL be much more commonly used than partitioning. The four features that are in this category are:

  • Data compression
  • Partitioning
  • Transparent data encryption
  • Change data capture

All of these require elevated permissions to enable EXCEPT data compression, which only requires ALTER permission on a table. This means someone with table-owner privileges could enable compression without the DBA knowing, and suddenly the database can't be attached/restored to, say, Standard Edition.

If you're a DBA and have just taken over a database, there's now an easy way to tell whether the database contains these features. A new DMV has been added - sys.dm_db_persisted_sku_features - that will report which of these four features are present in a database. Let's check it out.

Using a 2008 Enterprise instance:

CREATE DATABASE EnterpriseOnly;
GO
USE EnterpriseOnly;
GO

CREATE TABLE compressed (c1 INT) WITH (DATA_COMPRESSION = ROW);
GO

SELECT * FROM sys.dm_db_persisted_sku_features;
GO

feature_name    feature_id
--------------  -----------
Compression     100

Now let's try backing up and restoring the database on a different edition:

BACKUP DATABASE EnterpriseOnly TO DISK = 'C:\SQLskills\EnterpriseOnly.bck';
GO

And on a 2008 Express instance:

RESTORE DATABASE EnterpriseOnly FROM DISK = 'C:\SQLskills\EnterpriseOnly.bck'
   
WITH MOVE 'EnterpriseOnly' TO 'C:\SQLskills\EnterpriseOnly.mdf',
   
MOVE 'EnterpriseOnly_log' TO 'C:\SQLskills\EnterpriseOnly_log.ldf'
GO

Processed 160 pages for database 'EnterpriseOnly', file 'EnterpriseOnly' on file 1.
Processed 5 pages for database 'EnterpriseOnly', file 'EnterpriseOnly_log' on file 1.
Msg 3167, Level 16, State 1, Line 1
RESTORE could not start database 'EnterpriseOnly'.
Msg 3013, Level 16, State 1, Line 1
RESTORE DATABASE is terminating abnormally.
Msg 909, Level 21, State 1, Line 1
Database 'EnterpriseOnly' cannot be started in this edition of SQL Server because part or all of object 'compressed' is enabled with data compression or vardecimal storage format. Data compression and vardecimal storage format are only supported on SQL Server Enterprise Edition.
Msg 933, Level 21, State 1, Line 1
Database 'EnterpriseOnly' cannot be started because some of the database functionality is not available in the current edition of SQL Server.

Now, it's cool that it tells you exactly why the database couldn't be restored, but did you notice the first two lines of output? The database is fully restored BEFORE the operation fails! This makes perfect sense, as the database needs to be fully recovered before the server can tell whether any of the four features are still enabled or not. However, in a disaster recovery situation, waiting many hours for a database to restore only to be told that it can't be restored on this instance would be even more disastrous.

To summarize, you should always know what's happening in databases you manage. If portability of your databases to a lower Edition is important, make sure that none of these features can be enabled without you knowing about it.

While we were at TechEd in June, Kimberly and I participated in an hour-long discussion panel (Leveraging SQL Server Technologies to Build a Solid High-Availability Strategy) that was video-taped by the TechEd Online folks. It's now been edited and is available for download/viewing. We cover everything from requirements analysis to technology details in SQL Server 2008. The other panel members were Satya Jayanty, Allan Hirt, Kevin Farlee, and Amit Bansal.

You can browse the various online videos at http://technet.microsoft.com/en-us/events/teched/cc561184.aspx or go straight to the panel discussion video here.

Enjoy!

Well, almost... Kimberly and I have agreed to co-author the upcoming SQL Server 2008 Internals book with our good friend Kalen Delaney. This is the 2008 evolution of Kalen's Inside SQL Server series of books and will be published by MS Press around February 2009. The other authors are (also our good friends) Adam Machanic and Conor Cunningham.

Kimberly's going to write the chapter on Index Internals and I'm going to write the chapter on DBCC Internals. This is very exciting as these are our respective favorite subjects, as you probably already know - these won't be short chapters :-)

It's going to be a busy rest of the year - phew!

PS Next year we have plans to write a book ourselves - watch this space...

People have been complaining that I've stopped blogging so much - vacation folks, vacation! Today I've got a few class and conference posts to get through and then I'll get back to the technical posts.

We're doing a 2.5 day public class based on the SQL Server 2008 material we developed earlier this year. This will be part of a larger conference being hosted by Dev Connections in San Francisco, USA. Our workshop will run October 6th through 8th.

You can register and get more details at http://www.devconnections.com/SFWorkshops/default.asp?s=127.

Here's the abstract:

SQL Server 2008 offers an impressive array of capabilities for professional developers that build upon key innovations introduced in SQL Server 2005. The use of many of these will have manageability and infrastructure implications for a database—and hence the DBA! There are also enhancements to existing high-availability technologies, plus a variety of significant new tools to aid in managing performance, scalability, administration, and troubleshooting. This workshop helps you understand how to exploit the new toolset and how to manage a database that makes use of the new features in SQL Server 2008. The multi-day format of this event allows us to explore each feature in more detail, with more in-depth demonstrations and labs.

Topics covered include:

  • Availability Enhancements
    • Database Mirroring
    • Backup Compression
    • Peer-to-Peer Replication
  • Security Enhancements
    • Transparent Data Encryption
    • Extensible Key Management
    • All Actions Audited
  • Policy-Based Management and Multi-Server Administration
  • Troubleshooting and Throttling
    • Resource Governor
    • Extended Events
  • New Development Technologies
    • Spatial Indexes
    • Sparse Columns
    • Filtered Indexes and Statistics
    • Change Tracking and Change Data Capture
    • FILESTREAM
  • Performance Data Collection
  • Scalability Enhancements
    • Data Compression
    • Partition-Level Lock Escalation

This workshop runs Oct 6 (9am - 4pm), Oct 7 (9am - 4pm), Oct 8 (9am - 12pm).

We're sitting here in St. Pete Beach in Florida visiting some of Kimberly's family and having some sun-kissed R&R before heading back up to Seattle on Wednesday, and I thought I'd get the next post in my sparse columns mini-series out. Before I start though, Kimberly just posted the resources for the Accidental DBA class we taught at SQL Connections last week and in Iceland at the end of March - see here.

In my first post on sparse columns (see here) I introduced the concepts and explained why sparse columns are a useful feature. In this post I'm going to use the example I gave - a document repository with 50 document types and 20 unique attributes per document-type. Yes, it's a contrived example, but scale it up be a factor of 100+ (think Sharepoint Server) and methods like normalization no longer apply.

I'm using a CTP-6 VPC on a Lenovo T60P laptop with 4GB, a dual-core 2.2GHz CPU, and the VPC is running off a 6200RPM drive. Your mileage may vary for run-times of the example scripts. The VPC is the one we gave out at SQL Connections and Iceland, and you can download the scripts and a VPC (maybe only CTP-5) from the Microsoft JumpStart site (see here for details).

The first test I'll do is just creating the schema necessary to store the 1000+ columns of information in the test scenario. I'll do one with sparse columns and one without:

-- Create two tables, one with 1000 columns and one with 1000 columns but 997 sparse.
CREATE TABLE TableWithoutSparseColumns (
   DocID INT IDENTITY, DocName VARCHAR (100), DocType INT,
   c0004 INT NULL, c0005 INT NULL, c0006 INT NULL, c0007 INT NULL, c0008 INT NULL, c0009 INT NULL,
   ...
   c0994 INT NULL, c0995 INT NULL, c0996 INT NULL, c0997 INT NULL, c0998 INT NULL, c0999 INT NULL,
   c1000 INT NULL);
GO

CREATE TABLE TableWithSparseColumns (
   DocID INT IDENTITY, DocName VARCHAR (100), DocType INT,
   c0004 INT SPARSE NULL, c0005 INT SPARSE NULL, c0006 INT SPARSE NULL, c0007 INT SPARSE NULL,
   ...
   c0996 INT SPARSE NULL, c0997 INT SPARSE NULL, c0998 INT SPARSE NULL, c0999 INT SPARSE NULL,
   c1000 INT SPARSE NULL);
GO

I won't list all the column names for the sake of brevity. Next I'll insert some values into each table (the same values in each table):

-- Insert a few rows in each
INSERT INTO TableWithSparseColumns (DocName, Doctype) VALUES ('aaaa', 1);
INSERT INTO TableWithSparseColumns (DocName, Doctype, c0945) VALUES ('bbbb', 2, 46);
INSERT INTO TableWithSparseColumns (DocName, Doctype, c0334) VALUES ('cccc', 3, 44);
INSERT INTO TableWithSparseColumns (DocName, Doctype, c0233, c0234) VALUES ('dddd', 4, 12, 34);
INSERT INTO TableWithSparseColumns (DocName, Doctype, c0233, c0234,c0235,c0236) VALUES ('eeee', 4, 12, 34, 46, 66);
GO

INSERT INTO TableWithoutSparseColumns (DocName, Doctype) VALUES ('aaaa', 1);
INSERT INTO TableWithoutSparseColumns (DocName, Doctype, c0945) VALUES ('bbbb', 2, 46);
INSERT INTO TableWithoutSparseColumns (DocName, Doctype, c0334) VALUES ('cccc', 3, 44);
INSERT INTO TableWithoutSparseColumns (DocName, Doctype, c0233, c0234) VALUES ('dddd', 4, 12, 34);
INSERT INTO TableWithoutSparseColumns (DocName, Doctype, c0233, c0234,c0235,c0236) VALUES ('eeee', 4, 12, 34, 46, 66);
GO

Now let's see how big each table is:

-- Now lets see how big the rows are
SELECT [avg_record_size_in_bytes], [page_count] FROM sys.dm_db_index_physical_stats (
   DB_ID ('SparseColumnsTest'), OBJECT_ID ('TableWithoutSparseColumns'), NULL, NULL, 'DETAILED');

SELECT [avg_record_size_in_bytes], [page_count] FROM sys.dm_db_index_physical_stats (
   
DB_ID ('SparseColumnsTest'), OBJECT_ID ('TableWithSparseColumns'), NULL, NULL, 'DETAILED');
GO

avg_record_size_in_bytes page_count
------------------------ --------------------
4135                     5

(1 row(s) affected)

avg_record_size_in_bytes page_count
------------------------ --------------------
40.6                     1

(1 row(s) affected)

Ok - so that's not a huge difference in page count (because we've only got 5 rows), but it's a *massive* difference in average record size. Scaled up to hundreds of thousands or millions of records, the space savings will be astronomical!

Now let's try selecting the data back using results-to-grid mode and a simple SELECT * statement. It takes 20 seconds to return - solely because the client still has to retrieve the metadata for 1000+ columns. Even though the columns are still defined as SPARSE, they show up in a SELECT * resultset, and that makes extracting out the non-NULL values pretty difficult...

Time for another new feature - column sets. There's a new column type available for use with sparse columns - an XML COLUMN_SET. This is a column that is only materialized when selected, and will return all the non-NULL sparse columns in a row as an XML BLOB. It will also change the behavior of a SELECT * operation - removing all the sparse columns from the resultset and replacing them with itself, representing all the non-NULL sparse columns. Redefining our TableWithSparseColumns to have an XML COLUMN_SET column called SparseColumns (using the syntax 'SparseColumns XML COLUMN_SET FOR ALL_SPARSE_COLUMNS'), and re-inserting the same values then gives the following results for a SELECT * operation:

DocID   DocName   DocType   SparseColumns
------- --------- --------- ---------------
1       aaaa      1         NULL
2       bbbb      2         <c0945>46</c0945>
3       cccc      3         <c0334>44</c0334>
4       dddd      4         <c0233>12</c0233><c0234>34</c0234>
5       eeee      4         <c0233>12</c0233><c0234>34</c0234><c0235>46</c0235><c0236>66</c0236>

Pretty cool - and it returns virtually instantaneously (obviously scaling up to hundreds of thousands or millions of rows would take longer due to the time necessary to read the pages into the buffer pool). One downside is that the XML blob only returns the column name and value - not the datatype - but if your application can cope with that then not having to wade through hundreds (or thousands by RTM) of NULL columns values is great.

Next time I'll discuss the internals of how sparse columns are stored.

It's been quite a while since I wrote a blog post on a flight but I happened to be playing with a CTP-6 VPC on the way down to Orlando and thought I'd do the first in a few posts on a cool feature of SQL Server 2008 - Sparse Columns.
 
One problem in database schema design is how to store heterogenous data types with many properties efficiently. Here's an example (contrived) scenario - consider a document repository that can store up to 50 different kinds of documents, with a a few common properties (like document type, document name, last modification time) and 20 totally different attributes for each column type. The document repository needs to store the common fields, plus the per-document-type attributes for each document.

[Edit: I had several comments - thank you - that this example is too contrived and that normalization could give some benefit to it - ok- the example is *very* simple to illustrate the concepts involved. Now imagine the same example with thousands of document types and thousands of user-defined attributes per document - normalization is no longer applicable. This is the Sharepoint Server scenario that drives this feature.]
 
What are some of the options for this in SQL Server 2005?

Single table
Define a table with 1000+ columns to have allow all the document types to be stored in a single table. The first 20 columns could be for the common properties and then each subsequent set of columns stores the attributes for a single document type (e.g. columns 20-39 store the attributes for documents of type 1, columns 40-59 store the attributes for documents of type 2, and so on).

Comments on this architecture:

  • There is a huge amount of wasted space for each table row - as only a maximum of 40 columns (common fields plus per-document-type attributes) will have values in each 1000+ column record. Even if all the attributes are stored as nullable variable length columns (e.g. SQLVARIANT) then there's still a minimum of 1-bit of storage required per column (for the null bitmap entry). There's also the CPU overhead of having to crack the SQLVARIANT columns, and the storage overhead of having them in the first place.
  • The 8060 byte record size limit effectively limits the number and datatypes of columns that can be defined per record. You could easily have 1000 4-byte INT columns per record, for instance, but combinations of wider data-types becomes tricky.
  • There is a 1024 column limit per table. This puts a hard stop on the number of document types we can stores in our example.
  • Efficient indexes are impossible. Even if the index is defined to only contain the columns representing the attributes for a particular document type, they would still have a row for every document, regardless of the document type.

Vertically partition
Vertically partition the document repository such that each document type has its own table.

Comments on this architecture:

  • This allows an effectively unlimited number of document types to be supported, with a larger number of attributes for each document type, and with much more efficient indexes.
  • However, any operation that needs to operate over the entire document repository has to join all the tables (e.g. select all documents with a last modification date in the last 7 days)

Property bag
Use a table where the per-document-type properties are stored in a LOB value (somtimes called a property bag).

Comments on this architecture:

  • This also allows an effectively unlimited number of document types and attributes.
  • Accessing attributes is very slow and expensive, requiring reading into an offset inside the LOB column.
  • Indexing over attributes is going to be very expensive - requiring a computed column (to extract the attribute from the LOB value) for each attribute to be indexed

XML
Define an XML column which effectively acts as a property bag.

Comments on this architecture:

  • This also allows an effectively unlimited number of document types and attributes.
  • Accessing attributes is faster than using a LOB property bag but slower than regular columns, and requires XQuery operations.
  • Indexing is possible using XML indexes, but they're very space inefficient (a primary XML index shreds the entire XML column, and the XML column remains)

Basically - there's no good way to do it in SQL Server 2005 or before.
 
Enter SQL Server 2008 with sparse columns.
 
A sparse column is a nullable column that's declared with the SPARSE attribute. This means that when the value is null, it takes zero space - not even the single bit for the null bitmap is required - and this works even for fixed-length columns! The trade-off is that non-null sparse columns take an extra 4-bytes of space over regular columns. Here's an example for INT columns:

  • Non-null regular INT column: 4 bytes
  • Null regular INT column: 4 bytes
  • Non-null sparse INT column: 8 bytes
  • Null regular INT column: 0 bytes

Books Online has a table showing the potential space savings for the various data types using sparse columns. You can get to this table by looking for 'Sparse Columns' in the SQL Server 2008 Books Online index.
 
In my document repository example above, declaring each of the per-document type attributes as SPARSE would allow each record to only store the attributes needed for the document it represents, rather than every defined column - a huge space saving!
 
But what about the limit on the number of columns? Well, SQL Server 2008 is also bumping the number of columns per table to 30000 (see Kimberly's blog post from yesterday) - although not until the next CTP is available.
 
But how would indexing work? SQL Server 2008 has another new feature that helps here - filtered indexes. Conor's blogged about these recently (see here).
 
Over the next few weeks I'll post more on sparse columns - using them, comparisons with other schemas, and anything else I can come up with.

Over the weekend there was a question on one of the internal aliases at MS: how can I tell what percentage of a database has changed since the last full backup, so I can choose between a differential or full backup?

No such code exists as far as I know - until now! I happened to read the thread while sitting in the airport in Washington D.C. on the way back from Iceland so I started playing around and this morning I completed the code.

The code below creates a function and a stored procedure. The basic idea behind the code is as follows:

For each online data file in the database
   For each GAM interval in the file
      Crack the DIFF map page using DBCC PAGE
      Interpret the DIFF bitmap to aggregate the changed extents
      Add the sum to the total changed extents for the database
   End
End
Report results

There's a function that I create in msdb call SQLskillsConvertToExtents that cracks some of the DBCC PAGE output, and the main procedure is called sp_SQLskillsDIFForFULL and it created as a system object in master. I tried making it a table-valued function but you can't do things like INSERT-EXEC in a function, and that's required for processing the DBCC PAGE output. So - create your own wrapper function or whatever to use it. The interface/output is: