Sparse columns: misleading info in Books Online

The sparse columns feature in SQL Server 2008 is generating lots of interest from people looking to deploy extensible schemas. I've seen a few questions from people that are confused by some of the info in Books Online, particularly about adding and removing sparse columns from a table.

There's a section in BOL that states that sparse columns are added and removed from existing tables by creating a new copy of each row *on the same page* and then deleting the old row, and that this can fail when the row size is around 4009 bytes (1/2 the max row size when sparse columns exist). I've been trying to repro this behavior as it seemed a nonsensical design to have used (and the design was done after I'd left Microsoft so I've never seen the underlying code) – and I couldn't. I finally got around to discussing this with the dev team last week and had it confirmed that Books Online is indeed incorrect – there is no such issue with sparse columns. I've been told that BOL will be corrected.

One other issue that's come up is whether sparse columns work with row overflow (i.e. rows greater than the size of a page, where one or more variable-length columns are pushed into off-row storage). The answer is yes, it works just the same as when the column isn't sparse.

Hope this helps some of you.

SQL Server 2008: Sparse columns and XML COLUMN_SET


We’re sitting here in St. Pete Beach in Florida visiting some of Kimberly’s family and having some sun-kissed R&R before heading back up to Seattle on Wednesday, and I thought I’d get the next post in my sparse columns mini-series out. Before I start though, Kimberly just posted the resources for the Accidental DBA class we taught at SQL Connections last week and in Iceland at the end of March – see here.


In my first post on sparse columns (see here) I introduced the concepts and explained why sparse columns are a useful feature. In this post I’m going to use the example I gave – a document repository with 50 document types and 20 unique attributes per document-type. Yes, it’s a contrived example, but scale it up be a factor of 100+ (think Sharepoint Server) and methods like normalization no longer apply.


I’m using a CTP-6 VPC on a Lenovo T60P laptop with 4GB, a dual-core 2.2GHz CPU, and the VPC is running off a 6200RPM drive. Your mileage may vary for run-times of the example scripts. The VPC is the one we gave out at SQL Connections and Iceland, and you can download the scripts and a VPC (maybe only CTP-5) from the Microsoft JumpStart site (see here for details).


The first test I’ll do is just creating the schema necessary to store the 1000+ columns of information in the test scenario. I’ll do one with sparse columns and one without:



— Create two tables, one with 1000 columns and one with 1000 columns but 997 sparse.
CREATE TABLE TableWithoutSparseColumns (
   DocID INT IDENTITY, DocName VARCHAR (100), DocType INT,
   c0004 INT NULL, c0005 INT NULL, c0006 INT NULL, c0007 INT NULL, c0008 INT NULL, c0009 INT NULL,
   …
   c0994 INT NULL, c0995 INT NULL, c0996 INT NULL, c0997 INT NULL, c0998 INT NULL, c0999 INT NULL,
   c1000 INT NULL);
GO


CREATE TABLE TableWithSparseColumns (
   DocID INT IDENTITY, DocName VARCHAR (100), DocType INT,
   c0004 INT SPARSE NULL, c0005 INT SPARSE NULL, c0006 INT SPARSE NULL, c0007 INT SPARSE NULL,
   …
   c0996 INT SPARSE NULL, c0997 INT SPARSE NULL, c0998 INT SPARSE NULL, c0999 INT SPARSE NULL,
   c1000 INT SPARSE NULL);
GO


I won’t list all the column names for the sake of brevity. Next I’ll insert some values into each table (the same values in each table):



— Insert a few rows in each
INSERT INTO TableWithSparseColumns (DocName, Doctype) VALUES (‘aaaa’, 1);
INSERT INTO TableWithSparseColumns (DocName, Doctype, c0945) VALUES (‘bbbb’, 2, 46);
INSERT INTO TableWithSparseColumns (DocName, Doctype, c0334) VALUES (‘cccc’, 3, 44);
INSERT INTO TableWithSparseColumns (DocName, Doctype, c0233, c0234) VALUES (‘dddd’, 4, 12, 34);
INSERT INTO TableWithSparseColumns (DocName, Doctype, c0233, c0234,c0235,c0236) VALUES (‘eeee’, 4, 12, 34, 46, 66);
GO


INSERT INTO TableWithoutSparseColumns (DocName, Doctype) VALUES (‘aaaa’, 1);
INSERT INTO TableWithoutSparseColumns (DocName, Doctype, c0945) VALUES (‘bbbb’, 2, 46);
INSERT INTO TableWithoutSparseColumns (DocName, Doctype, c0334) VALUES (‘cccc’, 3, 44);
INSERT INTO TableWithoutSparseColumns (DocName, Doctype, c0233, c0234) VALUES (‘dddd’, 4, 12, 34);
INSERT INTO TableWithoutSparseColumns (DocName, Doctype, c0233, c0234,c0235,c0236) VALUES (‘eeee’, 4, 12, 34, 46, 66);
GO


Now let’s see how big each table is:



— Now lets see how big the rows are
SELECT [avg_record_size_in_bytes], [page_count] FROM sys.dm_db_index_physical_stats (
   DB_ID (‘SparseColumnsTest’), OBJECT_ID (‘TableWithoutSparseColumns’), NULL, NULL, ‘DETAILED’);


SELECT [avg_record_size_in_bytes], [page_count] FROM sys.dm_db_index_physical_stats (
   
DB_ID (‘SparseColumnsTest’), OBJECT_ID (‘TableWithSparseColumns’), NULL, NULL, ‘DETAILED’);
GO


avg_record_size_in_bytes page_count
———————— ——————–
4135                     5


(1 row(s) affected)


avg_record_size_in_bytes page_count
———————— ——————–
40.6                     1


(1 row(s) affected)


Ok – so that’s not a huge difference in page count (because we’ve only got 5 rows), but it’s a *massive* difference in average record size. Scaled up to hundreds of thousands or millions of records, the space savings will be astronomical!


Now let’s try selecting the data back using results-to-grid mode and a simple SELECT * statement. It takes 20 seconds to return – solely because the client still has to retrieve the metadata for 1000+ columns. Even though the columns are still defined as SPARSE, they show up in a SELECT * resultset, and that makes extracting out the non-NULL values pretty difficult…


Time for another new feature – column sets. There’s a new column type available for use with sparse columns – an XML COLUMN_SET. This is a column that is only materialized when selected, and will return all the non-NULL sparse columns in a row as an XML BLOB. It will also change the behavior of a SELECT * operation – removing all the sparse columns from the resultset and replacing them with itself, representing all the non-NULL sparse columns. Redefining our TableWithSparseColumns to have an XML COLUMN_SET column called SparseColumns (using the syntax ‘SparseColumns XML COLUMN_SET FOR ALL_SPARSE_COLUMNS‘), and re-inserting the same values then gives the following results for a SELECT * operation:



DocID   DocName   DocType   SparseColumns
——- ——— ——— —————
1       aaaa      1         NULL
2       bbbb      2         <c0945>46</c0945>
3       cccc      3         <c0334>44</c0334>
4       dddd      4         <c0233>12</c0233><c0234>34</c0234>
5       eeee      4         <c0233>12</c0233><c0234>34</c0234><c0235>46</c0235><c0236>66</c0236>


Pretty cool – and it returns virtually instantaneously (obviously scaling up to hundreds of thousands or millions of rows would take longer due to the time necessary to read the pages into the buffer pool). One downside is that the XML blob only returns the column name and value – not the datatype – but if your application can cope with that then not having to wade through hundreds (or thousands by RTM) of NULL columns values is great.


Next time I’ll discuss the internals of how sparse columns are stored.

SQL Server 2008: Sparse Columns

 

It’s been quite a while since I wrote a blog post on a flight but I happened to be playing with a CTP-6 VPC on the way down to Orlando and thought I’d do the first in a few posts on a cool feature of SQL Server 2008 – Sparse Columns.

One problem in database schema design is how to store heterogenous data types with many properties efficiently. Here’s an example (contrived) scenario – consider a document repository that can store up to 50 different kinds of documents, with a a few common properties (like document type, document name, last modification time) and 20 totally different attributes for each column type. The document repository needs to store the common fields, plus the per-document-type attributes for each document.

[Edit: I had several comments – thank you – that this example is too contrived and that normalization could give some benefit to it – ok- the example is *very* simple to illustrate the concepts involved. Now imagine the same example with thousands of document types and thousands of user-defined attributes per document – normalization is no longer applicable. This is the Sharepoint Server scenario that drives this feature.]

What are some of the options for this in SQL Server 2005?

Single table
Define a table with 1000+ columns to have allow all the document types to be stored in a single table. The first 20 columns could be for the common properties and then each subsequent set of columns stores the attributes for a single document type (e.g. columns 20-39 store the attributes for documents of type 1, columns 40-59 store the attributes for documents of type 2, and so on).

Comments on this architecture:

  • There is a huge amount of wasted space for each table row – as only a maximum of 40 columns (common fields plus per-document-type attributes) will have values in each 1000+ column record. Even if all the attributes are stored as nullable variable length columns (e.g. SQLVARIANT) then there’s still a minimum of 1-bit of storage required per column (for the null bitmap entry). There’s also the CPU overhead of having to crack the SQLVARIANT columns, and the storage overhead of having them in the first place.
  • The 8060 byte record size limit effectively limits the number and datatypes of columns that can be defined per record. You could easily have 1000 4-byte INT columns per record, for instance, but combinations of wider data-types becomes tricky.
  • There is a 1024 column limit per table. This puts a hard stop on the number of document types we can stores in our example.
  • Efficient indexes are impossible. Even if the index is defined to only contain the columns representing the attributes for a particular document type, they would still have a row for every document, regardless of the document type.

Vertically partition
Vertically partition the document repository such that each document type has its own table.

Comments on this architecture:

  • This allows an effectively unlimited number of document types to be supported, with a larger number of attributes for each document type, and with much more efficient indexes.
  • However, any operation that needs to operate over the entire document repository has to join all the tables (e.g. select all documents with a last modification date in the last 7 days)

Property bag
Use a table where the per-document-type properties are stored in a LOB value (somtimes called a property bag).

Comments on this architecture:

  • This also allows an effectively unlimited number of document types and attributes.
  • Accessing attributes is very slow and expensive, requiring reading into an offset inside the LOB column.
  • Indexing over attributes is going to be very expensive – requiring a computed column (to extract the attribute from the LOB value) for each attribute to be indexed

XML
Define an XML column which effectively acts as a property bag.

Comments on this architecture:

  • This also allows an effectively unlimited number of document types and attributes.
  • Accessing attributes is faster than using a LOB property bag but slower than regular columns, and requires XQuery operations.
  • Indexing is possible using XML indexes, but they’re very space inefficient (a primary XML index shreds the entire XML column, and the XML column remains)

Basically – there’s no good way to do it in SQL Server 2005 or before.

Enter SQL Server 2008 with sparse columns.

A sparse column is a nullable column that’s declared with the SPARSE attribute. This means that when the value is null, it takes zero space – not even the single bit for the null bitmap is required – and this works even for fixed-length columns! The trade-off is that non-null sparse columns take an extra 4-bytes of space over regular columns. Here’s an example for INT columns:

  • Non-null regular INT column: 4 bytes
  • Null regular INT column: 4 bytes
  • Non-null sparse INT column: 8 bytes
  • Null sparse INT column: 0 bytes

Books Online has a table showing the potential space savings for the various data types using sparse columns. You can get to this table by looking for ‘Sparse Columns’ in the SQL Server 2008 Books Online index.

In my document repository example above, declaring each of the per-document type attributes as SPARSE would allow each record to only store the attributes needed for the document it represents, rather than every defined column – a huge space saving!

But what about the limit on the number of columns? Well, SQL Server 2008 is also bumping the number of columns per table to 30000 (see Kimberly’s blog post from yesterday) – although not until the next CTP is available.

But how would indexing work? SQL Server 2008 has another new feature that helps here – filtered indexes. Conor’s blogged about these recently (see here).

Over the next few weeks I’ll post more on sparse columns – using them, comparisons with other schemas, and anything else I can come up with.