SQL Server 2008: Sparse Columns

 

It’s been quite a while since I wrote a blog post on a flight but I happened to be playing with a CTP-6 VPC on the way down to Orlando and thought I’d do the first in a few posts on a cool feature of SQL Server 2008 – Sparse Columns.

One problem in database schema design is how to store heterogenous data types with many properties efficiently. Here’s an example (contrived) scenario – consider a document repository that can store up to 50 different kinds of documents, with a a few common properties (like document type, document name, last modification time) and 20 totally different attributes for each column type. The document repository needs to store the common fields, plus the per-document-type attributes for each document.

[Edit: I had several comments – thank you – that this example is too contrived and that normalization could give some benefit to it – ok- the example is *very* simple to illustrate the concepts involved. Now imagine the same example with thousands of document types and thousands of user-defined attributes per document – normalization is no longer applicable. This is the Sharepoint Server scenario that drives this feature.]

What are some of the options for this in SQL Server 2005?

Single table
Define a table with 1000+ columns to have allow all the document types to be stored in a single table. The first 20 columns could be for the common properties and then each subsequent set of columns stores the attributes for a single document type (e.g. columns 20-39 store the attributes for documents of type 1, columns 40-59 store the attributes for documents of type 2, and so on).

Comments on this architecture:

  • There is a huge amount of wasted space for each table row – as only a maximum of 40 columns (common fields plus per-document-type attributes) will have values in each 1000+ column record. Even if all the attributes are stored as nullable variable length columns (e.g. SQLVARIANT) then there’s still a minimum of 1-bit of storage required per column (for the null bitmap entry). There’s also the CPU overhead of having to crack the SQLVARIANT columns, and the storage overhead of having them in the first place.
  • The 8060 byte record size limit effectively limits the number and datatypes of columns that can be defined per record. You could easily have 1000 4-byte INT columns per record, for instance, but combinations of wider data-types becomes tricky.
  • There is a 1024 column limit per table. This puts a hard stop on the number of document types we can stores in our example.
  • Efficient indexes are impossible. Even if the index is defined to only contain the columns representing the attributes for a particular document type, they would still have a row for every document, regardless of the document type.

Vertically partition
Vertically partition the document repository such that each document type has its own table.

Comments on this architecture:

  • This allows an effectively unlimited number of document types to be supported, with a larger number of attributes for each document type, and with much more efficient indexes.
  • However, any operation that needs to operate over the entire document repository has to join all the tables (e.g. select all documents with a last modification date in the last 7 days)

Property bag
Use a table where the per-document-type properties are stored in a LOB value (somtimes called a property bag).

Comments on this architecture:

  • This also allows an effectively unlimited number of document types and attributes.
  • Accessing attributes is very slow and expensive, requiring reading into an offset inside the LOB column.
  • Indexing over attributes is going to be very expensive – requiring a computed column (to extract the attribute from the LOB value) for each attribute to be indexed

XML
Define an XML column which effectively acts as a property bag.

Comments on this architecture:

  • This also allows an effectively unlimited number of document types and attributes.
  • Accessing attributes is faster than using a LOB property bag but slower than regular columns, and requires XQuery operations.
  • Indexing is possible using XML indexes, but they’re very space inefficient (a primary XML index shreds the entire XML column, and the XML column remains)

Basically – there’s no good way to do it in SQL Server 2005 or before.

Enter SQL Server 2008 with sparse columns.

A sparse column is a nullable column that’s declared with the SPARSE attribute. This means that when the value is null, it takes zero space – not even the single bit for the null bitmap is required – and this works even for fixed-length columns! The trade-off is that non-null sparse columns take an extra 4-bytes of space over regular columns. Here’s an example for INT columns:

  • Non-null regular INT column: 4 bytes
  • Null regular INT column: 4 bytes
  • Non-null sparse INT column: 8 bytes
  • Null sparse INT column: 0 bytes

Books Online has a table showing the potential space savings for the various data types using sparse columns. You can get to this table by looking for ‘Sparse Columns’ in the SQL Server 2008 Books Online index.

In my document repository example above, declaring each of the per-document type attributes as SPARSE would allow each record to only store the attributes needed for the document it represents, rather than every defined column – a huge space saving!

But what about the limit on the number of columns? Well, SQL Server 2008 is also bumping the number of columns per table to 30000 (see Kimberly’s blog post from yesterday) – although not until the next CTP is available.

But how would indexing work? SQL Server 2008 has another new feature that helps here – filtered indexes. Conor’s blogged about these recently (see here).

Over the next few weeks I’ll post more on sparse columns – using them, comparisons with other schemas, and anything else I can come up with.

 

Search Engine Q&A #18: What’s the current uptime of SQL Server?

Here's a quickie just before we head off to SQL Connections in Orlando.

On one of the internal MS forums was the question – how can I tell through T-SQL the last time SQL Server restarted (i.e. the current 'uptime')? The answer relies on the fact that all the background tasks that start when SQL Server starts must record a 'login time'.

You can get this from:

SELECT [login_time] FROM sysprocesses WHERE spid = 1;
GO

Or more simply:

SELECT MIN ([login_time]) FROM sysprocesses;
GO

Pretty neat trick!

As with the last few conferences, I'll try to blog every day during SQL Connections under the Conference Questions Pot-Pourri category.

Hope to see a bunch of you there!

PS Some people have suggested that checking the creation date of tempdb will also do the trick. That's not a *guaranteed* method as PSS could have used T3609 to recover tempdb instead of recreating it (if they're troubleshooting some tempdb issue). In that case the creation date of tempdb will *not* be the time the server started. Checking the time in sysprocesses is the only infallible method.

New script: How much of the database has changed since the last full backup?

Over the weekend there was a question on one of the internal aliases at MS: how can I tell what percentage of a database has changed since the last full backup, so I can choose between a differential or full backup?

No such code exists as far as I know – until now! I happened to read the thread while sitting in the airport in Washington D.C. on the way back from Iceland so I started playing around and this morning I completed the code.

The code below creates a function and a stored procedure. The basic idea behind the code is as follows:

For each online data file in the database
   For each GAM interval in the file
      Crack the DIFF map page using DBCC PAGE
      Interpret the DIFF bitmap to aggregate the changed extents
      Add the sum to the total changed extents for the database
   End
End
Report results

There’s a function that I create in msdb called SQLskillsConvertToExtents that cracks some of the DBCC PAGE output, and the main procedure is called sp_SQLskillsDIFForFULL and it created as a system object in master. I tried making it a table-valued function but you can’t do things like INSERT EXEC in a function, and that’s required for processing the DBCC PAGE output. So – create your own wrapper function or whatever to use it. The interface/output is:

EXEC sp_SQLskillsDIFForFULL 'msdb';
GO
Total Extents Changed Extents Percentage Changed
------------- --------------- ----------------------
102           56              54.9

It’s been tested with databases with multiple files and up to around 125TB. It will not work on SQL Server 2000

Note that after doing a full backup you will never see Changed Extents equal to zero. It will always be 4 + (number of online data files – 1), and around 20 or so for msdb. This is because the extent containing the file header in each file is always marked as changed, as are three extents in the primary file containing the roots of some critical system tables.

Anyway – here it is. You can download it in a zip file from SQLskillsDIFForFULL.zip (2.65KB). Enjoy!

/*============================================================================
  File:     SQLskillsDIFForFULL.sql
  Summary:  This script creates a system-wide SP SQLskillsDIFForFILL that
	works out what percentage of a database has changed since the
	previous full database backup.

  Date:     April 2008 (revised August 2017)

  SQL Server Versions: All version post SQL Server 2000
------------------------------------------------------------------------------
  Copyright (C) 2008-2017 Paul S. Randal, SQLskills.com
  All rights reserved.

  For more scripts and sample code, check out 
    http://www.SQLskills.com

  You may alter this code for your own *non-commercial* purposes. You may
  republish altered code as long as you give due credit.
  
  THIS CODE AND INFORMATION ARE PROVIDED "AS IS" WITHOUT WARRANTY OF 
  ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING BUT NOT LIMITED 
  TO THE IMPLIED WARRANTIES OF MERCHANTABILITY AND/OR FITNESS FOR A
  PARTICULAR PURPOSE.
============================================================================*/

-- Create the function in MSDB
--
USE msdb;
GO

IF EXISTS (SELECT * FROM sys.objects WHERE NAME = N'SQLskillsConvertToExtents')
    DROP FUNCTION [SQLskillsConvertToExtents];
GO

-- This function cracks the output from a DBCC PAGE dump
-- of an allocation bitmap. It takes a string in the form
-- "(1:8) - (1:16)" or "(1:8) -" and returns the number
-- of extents represented by the string. Both the examples
-- above equal 1 extent.
--
CREATE FUNCTION [SQLskillsConvertToExtents] (
	@extents	VARCHAR (100))
RETURNS INTEGER
AS
BEGIN
    DECLARE @extentTotal    INT;
    DECLARE @colon          INT;
	
    DECLARE @firstExtent    INT;
    DECLARE @secondExtent   INT;

    SET @extentTotal = 0;
    SET @colon = CHARINDEX (':', @extents);

    -- Check for the single extent case
    --
    IF (CHARINDEX (':', @extents, @colon + 1) = 0)
        SET @extentTotal = 1;
    ELSE
        -- We're in the multi-extent case
        --
        BEGIN
        SET @firstExtent = CONVERT (INT,
            SUBSTRING (@extents, @colon + 1, CHARINDEX (')', @extents, @colon) - @colon - 1));
        SET @colon = CHARINDEX (':', @extents, @colon + 1);
        SET @secondExtent = CONVERT (INT,
            SUBSTRING (@extents, @colon + 1, CHARINDEX (')', @extents, @colon) - @colon - 1));
        SET @extentTotal = (@secondExtent - @firstExtent) / 8 + 1;
    END

    RETURN @extentTotal;
END;
GO

USE [master];
GO

IF OBJECT_ID (N'sp_SQLskillsDIFForFULL') IS NOT NULL
    DROP PROCEDURE [sp_SQLskillsDIFForFULL];
GO

-- This SP cracks all differential bitmap pages for all online
-- data files in a database. It creates a sum of changed extents
-- and reports it as follows (example small msdb):
-- 
-- EXEC sp_SQLskillsDIFForFULL 'msdb';
-- GO
--
-- Total Extents Changed Extents Percentage Changed
-- ------------- --------------- ----------------------
-- 102           56              54.9
--
-- Note that after a full backup you will always see some extents
-- marked as changed. The number will be 4 + (number of data files - 1).
-- These extents contain the file headers of each file plus the
-- roots of some of the critical system tables in file 1.
-- The number for msdb may be around 20.
--
CREATE PROCEDURE [sp_SQLskillsDIFForFULL] (
    @dbName SYSNAME)
AS
BEGIN
    SET NOCOUNT ON;

    -- Create the temp table
    --
    IF EXISTS (SELECT * FROM [msdb].[sys].[objects] WHERE NAME = N'SQLskillsDBCCPage')
        DROP TABLE [msdb].[dbo].[SQLskillsDBCCPage];

    CREATE TABLE msdb.dbo.SQLskillsDBCCPage (
        [ParentObject]  VARCHAR (100),
        [Object]        VARCHAR (100),
        [Field]         VARCHAR (100),
        [VALUE]         VARCHAR (100));	

    DECLARE @fileID          INT;
    DECLARE @fileSizePages   INT;
    DECLARE @extentID        INT;
    DECLARE @pageID          INT;
    DECLARE @DIFFTotal       BIGINT;
    DECLARE @sizeTotal       BIGINT;
    DECLARE @total           BIGINT;
    DECLARE @dbccPageString  VARCHAR (200);

    SELECT @DIFFtotal = 0;
    SELECT @sizeTotal = 0;

    -- Setup a cursor for all online data files in the database
    --
    DECLARE [files] CURSOR FOR
        SELECT [file_id], [size] FROM master.sys.master_files
        WHERE [type_desc] = N'ROWS'
        AND [state_desc] = N'ONLINE'
        AND [database_id] = DB_ID (@dbName);

    OPEN files;

    FETCH NEXT FROM [files] INTO @fileID, @fileSizePages;

    WHILE @@FETCH_STATUS = 0
    BEGIN
        SELECT @extentID = 0;

        -- The size returned from master.sys.master_files is in
        -- pages - we need to convert to extents
        --
        SELECT @sizeTotal = @sizeTotal + @fileSizePages / 8;

        WHILE (@extentID < @fileSizePages)
        BEGIN
            SELECT @pageID = @extentID + 6;

            -- Build the dynamic SQL
            --
            SELECT @dbccPageString = 'DBCC PAGE (['
                + @dbName + '], '
                + CAST (@fileID AS VARCHAR) + ', '
                + CAST (@pageID AS VARCHAR) + ', 3) WITH TABLERESULTS, NO_INFOMSGS';

            -- Empty out the temp table and insert into it again
            --
            TRUNCATE TABLE [msdb].[dbo].[SQLskillsDBCCPage];
            INSERT INTO [msdb].[dbo].[SQLskillsDBCCPage] EXEC (@dbccPageString);

            -- Aggregate all the changed extents using the function
            --
            SELECT @total = SUM ([msdb].[dbo].[SQLskillsConvertToExtents] ([Field]))
            FROM [msdb].[dbo].[SQLskillsDBCCPage]
            WHERE [VALUE] = '    CHANGED'
            AND [ParentObject] LIKE 'DIFF_MAP%';

            SET @DIFFtotal = @DIFFtotal + @total;

            -- Move to the next GAM extent
            SET @extentID = @extentID + 511232;
        END

        FETCH NEXT FROM [files] INTO @fileID, @fileSizePages;
    END;

    -- Clean up
    --
    DROP TABLE [msdb].[dbo].[SQLskillsDBCCPage];
    CLOSE [files];
    DEALLOCATE [files];

    -- Output the results]
    --
    SELECT
        @sizeTotal AS [Total Extents],
        @DIFFtotal AS [Changed Extents],
        ROUND (
            (CONVERT (FLOAT, @DIFFtotal) /
            CONVERT (FLOAT, @sizeTotal)) * 100, 2) AS [Percentage Changed];
END;
GO

-- Mark the SP as a system object
--
EXEC sys.sp_MS_marksystemobject [sp_SQLskillsDIFForFULL];
GO

-- Test to make sure everything was setup correctly
--
EXEC [sp_SQLskillsDIFForFULL] N'msdb';
GO