Back in September I blogged about an old 2005 bug that prevented DBCC CHECKFILEGROUP checking the partitions of an object on the specified filegroup unless *all* partitions of the object are on the specified filegroup (not a smart way to set things up!). The bug was fixed ages ago in 2005 but has only just been fixed in 2008. The post which explains the bug in more detail is DBCC CHECKFILEGROUP bug on SQL Server 2008. This is an important feature to be able to split the consistency checks of a partitioned VLDB over a series of days - see CHECKDB From Every Angle: Consistency Checking Options for a VLDB for more details.

SQL Server 2008 SP1 Cumulative Update 6 (which you can get here) has the bug fix for 2008 finally. The KB article which describes the bug is 975991.

You're all running regular consistency checks, right?

This is an interesting performance bug concerning a broken query optimizer rule in 2008. Thanks to Dan Shargel (Twitter) for sending me info on this and letting me use some of the stuff he sent.

The scenario involves using MIN or MAX in a query against a partitioned view. In 2005 the query plan includes a TOP (1) operator which uses the right index, but in 2008 the optimizer rule was broken and the plan turned into a stream aggregate, much more expensive in this case.

Here's the 2005 query plan:

 

and here's the 2008 query plan (before the bug fix):

 

You can get the fix in CU4 for 2008 SP1 (or later) and read a bit more about it in KB 973255.

Note that you have to turn on trace flag 4199 to enable the fix - that requirement will be removed in SQL11.

The 35-page whitepaper on high availability I wrote for the SQL team over the summer has been published on MSDN. It's a 2-300 level whitepaper that describes the various high-availability technologies in SQL Server 2008 and how they can be used to mitigate disasters. It's chock-full of links to other whitepapers, technical articles and Books Online sections and also presents my methodology for planning a high-availability strategy.

You can get it at http://msdn.microsoft.com/en-us/library/ee523927.aspx.

Enjoy!

Here's the table of contents:

  • Introduction
  • Causes of Downtime and Data Loss
    • Planned Downtime
    • Unplanned Downtime and Data Loss
  • Planning a High-Availability Strategy
    • Requirements
    • Limitations
    • Technology Evaluation
  • SQL Server 2008 High-Availability Technologies
    • Logging and Recovery
    • Backup, Restore, and Related Technologies
      • Partial Database Availability and Online Piecemeal Restore
      • Instant File Initialization
      • Mirrored Backups
      • Backup Checksums
      • Backup Compression
    • Other Single-Instance Technologies
      • Online Operations
      • Database Snapshots
      • Hot-Add Memory and CPU
      • Resource Governor
    • Multi-Instance Technologies
      • Log Shipping
      • Transactional Replication
      • Database Mirroring
      • Failover Clustering
      • Combining Multi-Instance Technologies
      • Virtualization
  • Mitigating the Causes of Downtime and Data Loss
  • High-Availability Features Supported by SQL Server 2008 Editions
  • Conclusion

Here's an interesting bug that surfaced recently, first reported by Bryan Smith on the MSDN disaster recovery/HA forum three weeks ago.

One of the mechanisms I advise for running consistency checks on VLDBs with multiple filegroups is to run successive DBCC CHECKFILEGROUP operations on the filegroups holding the partitions (see CHECKDB From Every Angle: Consistency Checking Options for a VLDB). Prior to SQL Server 2005 SP3, if a table or index is split into multiple partitions, then DBCC CHECKFILEGROUP would skip checking the entire table or index if it was partitioned over multiple filegroups. From SQL Server 2005 SP3 onwards, DBCC CHECKFILEGROUP will validate only the partitions of tables and indexes that reside on the filegroup being checked - rather than skipping the whole table or index - a big improvement.

Now it seems that SQL Server 2008 has a bug where it essentially has regressed back to the old behavior where DBCC CHECKFILEGROUP will skip a table or index if it's not wholely contained on the filegroup being checked.

Here's a script you can use to test this. It creates a partitioned table over multiple filegroups and then runs DBCC CHECKFILEGROUP on the first partition. I'll discuss the results after the script.

CREATE DATABASE DBMaint2008;
GO
USE DBMaint2008;
GO

ALTER DATABASE DBMaint2008 ADD FILEGROUP DataPartition1;
ALTER DATABASE DBMaint2008 ADD FILEGROUP DataPartition2;
ALTER DATABASE DBMaint2008 ADD FILEGROUP DataPartition3;
GO

ALTER DATABASE DBMaint2008 ADD FILE
   (NAME = N'DataPartition1', FILENAME = N'C:\SQLskills\DataPartition1.ndf',
   SIZE = 10, FILEGROWTH = 10)
TO FILEGROUP DataPartition1;
GO
ALTER DATABASE DBMaint2008 ADD FILE
   (NAME = N'DataPartition2', FILENAME = N'C:\SQLskills\DataPartition2.ndf',
   SIZE = 10, FILEGROWTH = 10)
TO FILEGROUP DataPartition2;
GO
ALTER DATABASE DBMaint2008 ADD FILE
   (NAME = N'DataPartition3', FILENAME = N'C:\SQLskills\DataPartition3.ndf',
   SIZE = 10, FILEGROWTH = 10)
TO FILEGROUP DataPartition3;
GO

CREATE PARTITION FUNCTION Partitions_PFN (INT)
AS RANGE RIGHT FOR VALUES (1000, 2000);
GO

CREATE PARTITION SCHEME [Partitions_PS]
AS PARTITION [Partitions_PFN] TO (DataPartition1, DataPartition2, DataPartition3);
GO

CREATE TABLE TestTable (c1 INT IDENTITY, c2 DATETIME DEFAULT GETDATE ());
CREATE UNIQUE CLUSTERED INDEX TestPK ON TestTable (c1) ON Partitions_PS (c1);
GO

SET NOCOUNT ON;
GO
INSERT INTO TestTable DEFAULT VALUES;
GO 3000

DBCC CHECKFILEGROUP (DataPartition1);
GO

On SQL Server 2005 SP3, the output from the final batch contains:

DBCC results for 'TestTable'.
Cannot process rowset ID 72057594038452224 of object "TestTable" (ID 2073058421), index "TestPK" (ID 1), because it resides on filegroup "DataPartition2" (ID 3), which was not checked. 
Cannot process rowset ID 72057594038517760 of object "TestTable" (ID 2073058421), index "TestPK" (ID 1), because it resides on filegroup "DataPartition3" (ID 4), which was not checked. 
There are 999 rows in 3 pages for object "TestTable".

This shows that it processed the 1000 rows in the first partition, but not the other two - as we'd expect. 

On SQL Server 2008, the output for TestTable is limited to:

Cannot process rowset ID 72057594038910976 of object "TestTable" (ID 2105058535), index "TestPK" (ID 1), because it resides on filegroup "DataPartition2" (ID 3), which was not checked.

And that's it - nothing about it processing any rows in partition 1. This shows that DBCC didn't process the first partition as we'd expect - this becomes even more apparent with very large amounts of data, where DBCC CHECKFILEGROUP will just complete almost instantly.

As Bryan says in his post, Microsoft has acknowledged this is a bug and it should hopefully be fixed for 2008 CU5. In the meantime, this is something you should be aware of as your tables may not be being checked properly.

A couple of weeks ago I kicked off the latest survey, on what the physical layout of your databases are and why you have them that way (see here for the survey). I let the survey run for a while to get a good sampling, and I wasn't disappointed, with over 1000 responses! Here are the results as of 4/27/2009.

Just like any other 'best practice' kind of topic, the question of how to design the physical layout of a database provokes a lot of (sometimes heated) discussion. There are lots of options and there are even more factors to consider - so the best answer is my perennial favorite "it depends"! In this post, I don't want to tell you how I think you should layout your database - instead I want to discuss some of the options and let you make up your own mind, with the added benefit of data on what your peers are doing with their databases. The main point of this survey was to see what people are doing, rather than as a driver for an editorial blog post.

As you can clearly see from the results above, and predictably, the distribution of layout types shifts as the database size increases - but I was very surprised by the number of single file databases over 10GB. Rather than go through each option in the survey, I'm going to talk a bit about some of the things to consider when planning a layout.

Underlying I/O subsystem

This could be the most important factor to consider. If you only have a single physical drive, for instance, there's arguably not much point creating multiple data files, as that will force the disk heads to bounce back and forth to the different file locations on the disk. On the other hand, if you have a SAN with several thousand drives grouped together into multiple LUNs, your possibilities are a lot wider (and maybe much harder to come up with the optimal layout). Several people asked if I'd go into depth around having multiple controllers, and different drive layouts in a SAN - and my answer is no. I'm not an expert at storage design, which, like indexing, is both an art and a science. There's a good whitepaper that discusses some of this: Physical Database Storage Design, which I helped review back in my MS days.

Performance, recoverability, manageability

Having multiple files with separate storage for each allows reads and writes to be parallelized for increased performance, lowering the amount of disk head contention. When a checkpoint occurs, and pages are written to disk, spreading the I/O load over multiples files can speed up the checkpoint and reduce the IOPS spikes that you may see. It can also lead to reduced contention for the various allocation bitmaps - in the same way as I've described for tempdb. In user databases with a very high rate of allocations, contention can arise on the GAM pages - but it's not common. Some people also advocate having separate filegroups for tables and indexes, and although this can sometimes be more trouble than it's worth, and often turns into a religious debate, I have heard of people getting a perf boost from this.

One of the most convincing reasons (I find) for having multiple filegroups is the ability to do much more targeted recovery. With a single file database, if it gets corrupted or lost, you have to restore the whole database, no matter how large the file is - and this can seriously affect your ability to recover within the RTO (Recovery Time Objective) agreement. By splitting the database into multiple filegroups, you can make use of partial database availability and online piecemeal restores (in Enterprise Edition) to allow the database to be online as soon as the primary filegroup is online, and then restore the remaining filegroups in priority order - bringing the application online as soon as the relevant filegroups are online. You can even use this layout to spread your backup workload - moving to filegroup-based backups instead of database backups, although this isn't very common.

As far as manageability is concerned, there are a few reasons to have multiple filegroups. Firstly, you can isolate a table that requires a lot of I/O (e.g. in terms of index maintenance) on separate storage from other tables, so that maintenance operations (and the I/O overhead of doing them) doesn't interfere with the I/O of the other tables. Also, you can provision different kinds of storage for different tables - in terms of disk speed and RAID level (redundancy), for instance. If you want to be able to move data around, you can do it much more easily if the database is split up, than if it's a single file.

Summary 

Ok - so I lied. I *am* going to offer advice - against one of the options: single filegroup, single file. For smaller databases, this is fine - but as the database size gets larger, say, over tens of GB, then having a single file can become a serious liability. With a single file database (or even a single filegroup database), you lose most of the benefits mentioned above.

Bottom line - as your databases get larger, you're going to need to think more carefully about their layout, otherwise you could run into big problems as your workload increases or when disaster strikes. As the survey results show, this is what your peers are doing.

Next post - this week's survey!

This week's survey is a little more complicated. I'm interested in the physical layout of your databases. I've got four surveys, for a variety of database sizes. Please vote multiple times in each survey, as you see fit - and by all means forward this link to your friends/clients/etc or re-blog/tweet it. I'm going to report on this survey in two weeks, to give a bit more time for people to respond (and because we're travelling next week). I think we're going to see some interesting statistics come out of this - the more people that respond the better. I'll report on it 4/24/09.

One thing to note - this is just for user databases, not for tempdb. In the surveys, "multiple filegroups" implies multiple files too, and if you don't have them spread exactly one per drive/etc, just choose that option - I only have 10 options to choose from in the free surveys. 

(If you're in the over 1TB range and have multiple files/filesgroups spread over multiple drives/arrays/LUNs, vote using the last option on the >1TB survey and I'll lump them together.) 

Phew - thanks!

Back in 2005 Kimberly produced two very popular webcast series - an 11-part webcast series for TechNet called SQL Server 2005 for the IT Professional and a 10-part webcast series for MSDN called A Primer for Proper SQL Server Development. The webcast links and blog posts were broken for quite a while but now they're all fixed up and working again. I've created some web pages that link to all the webcasts and blog posts, along with abstracts. I've also included some more recent ones too and will be adding to the list over the next few weeks.

There's over 30 hours of good stuff to watch - check them out at http://www.sqlskills.com/webcasts.asp

OK - last content post today. I forgot that the February TechNet Magazine also has the latest edition of my regular SQL Q&A column. This month's column covers:

  • Should backup compression be enabled at the instance level?
  • Client redirection during database mirroring failovers
  • Partition-level lock escalation in SQL Server 2008
  • Is it ever safe to rebuild a transaction log?

Check out the column at http://technet.microsoft.com/en-us/magazine/2009.02.sqlqa.aspx

Moving databases around is pretty common, as is moving databases between servers running different Editions of SQL Server, especially during a disaster recovery situation. You may not know this, but in SQL Server 2005, if you had partitioning anywhere in a database, you could only attach/restore that database using an Enterprise or Developer instance. I say 'you may not know this' because partitioning isn't really widely used (from what I can gather from talking to customers). The upshot of this is that SQL Server 2005 will refuse to attach/restore a database with partitioning in - even if you're in a disaster recovery situation and the only server you have available has Standard Edition.

In SQL Server 2008, a lot more people are going to bump into this issue. The list of features that are Enterprise only, and will prevent attach/restore using a lower edition has expanded to include 3 new features that WILL be much more commonly used than partitioning. The four features that are in this category are:

  • Data compression
  • Partitioning
  • Transparent data encryption
  • Change data capture

All of these require elevated permissions to enable EXCEPT data compression, which only requires ALTER permission on a table. This means someone with table-owner privileges could enable compression without the DBA knowing, and suddenly the database can't be attached/restored to, say, Standard Edition.

If you're a DBA and have just taken over a database, there's now an easy way to tell whether the database contains these features. A new DMV has been added - sys.dm_db_persisted_sku_features - that will report which of these four features are present in a database. Let's check it out.

Using a 2008 Enterprise instance:

CREATE DATABASE EnterpriseOnly;
GO
USE EnterpriseOnly;
GO

CREATE TABLE compressed (c1 INT) WITH (DATA_COMPRESSION = ROW);
GO

SELECT * FROM sys.dm_db_persisted_sku_features;
GO

feature_name    feature_id
--------------  -----------
Compression     100

Now let's try backing up and restoring the database on a different edition:

BACKUP DATABASE EnterpriseOnly TO DISK = 'C:\SQLskills\EnterpriseOnly.bck';
GO

And on a 2008 Express instance:

RESTORE DATABASE EnterpriseOnly FROM DISK = 'C:\SQLskills\EnterpriseOnly.bck'
   
WITH MOVE 'EnterpriseOnly' TO 'C:\SQLskills\EnterpriseOnly.mdf',
   
MOVE 'EnterpriseOnly_log' TO 'C:\SQLskills\EnterpriseOnly_log.ldf'
GO

Processed 160 pages for database 'EnterpriseOnly', file 'EnterpriseOnly' on file 1.
Processed 5 pages for database 'EnterpriseOnly', file 'EnterpriseOnly_log' on file 1.
Msg 3167, Level 16, State 1, Line 1
RESTORE could not start database 'EnterpriseOnly'.
Msg 3013, Level 16, State 1, Line 1
RESTORE DATABASE is terminating abnormally.
Msg 909, Level 21, State 1, Line 1
Database 'EnterpriseOnly' cannot be started in this edition of SQL Server because part or all of object 'compressed' is enabled with data compression or vardecimal storage format. Data compression and vardecimal storage format are only supported on SQL Server Enterprise Edition.
Msg 933, Level 21, State 1, Line 1
Database 'EnterpriseOnly' cannot be started because some of the database functionality is not available in the current edition of SQL Server.

Now, it's cool that it tells you exactly why the database couldn't be restored, but did you notice the first two lines of output? The database is fully restored BEFORE the operation fails! This makes perfect sense, as the database needs to be fully recovered before the server can tell whether any of the four features are still enabled or not. However, in a disaster recovery situation, waiting many hours for a database to restore only to be told that it can't be restored on this instance would be even more disastrous.

To summarize, you should always know what's happening in databases you manage. If portability of your databases to a lower Edition is important, make sure that none of these features can be enabled without you knowing about it.

Now we're back from Iceland and I have a week to catch up with some content development before the MVP Summit next week and then SQL Connections the following week.

One of the things I struggled with earlier in the year while writing a SQL Server 2008 training course for Microsoft was how to get FILESTREAM to work with partitioning. There wasn't (and still isn't) any information in Books Online that I could find so I had to play around to figure it out.

I should say that the CTP-6/February CTP version of Books Online *does* have a bunch of code examples around using FILESTREAM, so I'm not going to write a blog post about that. Look in the Getting Started with FILESTREAM Storage section (or paste this link into the Books Online URL: window).

Back to partitioning - first I created a test database:

CREATE DATABASE FileStreamTestDB
ON PRIMARY
   
(NAME = FileStreamTestDB_data,
      
FILENAME = N'C:\Metro Demos\FileStreamTestDB\FSTestDB_data.mdf'),
FILEGROUP FileStreamFileGroup CONTAINS FILESTREAM
   
(NAME = FileStreamTestDBDocuments,
      
FILENAME = N'C:\Metro Demos\FileStreamTestDB\Documents')
LOG ON
   
(NAME = 'FileStreamTestDB_log',
      
FILENAME = N'C:\Metro Demos\FileStreamTestDB\FSTestDB_log.ldf');
GO

Then I tried the obvious, knowing that I can't partition on the ROWGUIDCOL:

CREATE PARTITION FUNCTION MyPartFunction (INT) AS RANGE RIGHT FOR VALUES (1000, 2000);

CREATE PARTITION SCHEME MyPartScheme AS PARTITION MyPartFunction ALL TO ([PRIMARY]);

CREATE TABLE FileStreamTest (
   
TestId UNIQUEIDENTIFIER ROWGUIDCOL NOT NULL UNIQUE,
   
Customer INT,
   
[
Name] VARCHAR (25),
   
[
Document] VARBINARY(MAX) FILESTREAM)
ON MyPartScheme (Customer);
GO

Partition scheme 'MyPartScheme' has been created successfully. 'PRIMARY' is marked as the next used filegroup in partition scheme 'MyPartScheme'.
Msg 1921, Level 16, State 4, Line 8
Invalid filegroup 'default' specified.

Eventually I worked out that you have to define a separate partitioning scheme just for FILESTREAM data. This is because the regular data is stored on non-FILESTREAM filegroups, so trying to use the regular partitioning scheme for FILESTREAM would mean telling the Engine to store the FILESTREAM data in non-FILESTREAM filegroups. Clearly a non-starter. Ok - try again with a separate partitioning scheme (the prior MyPartFunction partition function and MyPartScheme partition scheme already exist now remember):

CREATE PARTITION SCHEME MyFSPartScheme AS PARTITION MyPartFunction ALL TO ([FileStreamFileGroup]);

CREATE TABLE FileStreamTest (
   
TestId UNIQUEIDENTIFIER ROWGUIDCOL NOT NULL UNIQUE,
   
Customer INT,
   
[Name] VARCHAR (25),
   
[Document] VARBINARY(MAX) FILESTREAM)
ON MyPartScheme (Customer)
FILESTREAM_ON MyFSPartScheme;
GO

Partition scheme 'MyFSPartScheme' has been created successfully. 'FileStreamFileGroup' is marked as the next used filegroup in partition scheme 'MyFSPartScheme'.
Msg 1908, Level 16, State 1, Line 1
Column 'Customer' is partitioning column of the index 'UQ__FileStreamTest__03317E3D'. Partition columns for a unique index must be a subset of the index key.
Msg 1750, Level 16, State 0, Line 1
Could not create constraint. See previous errors.

Hmm - I can't partition on Customer because there's already a unique index over TestId - UNLESS I specifically set the unique index on TestId to be non-partitioned by setting a filegroup for it:

CREATE TABLE FileStreamTest (
   
TestId UNIQUEIDENTIFIER ROWGUIDCOL NOT NULL UNIQUE ON [PRIMARY],
   Customer INT,
   
[Name] VARCHAR (25),
   
[Document] VARBINARY(MAX) FILESTREAM)
ON MyPartScheme (Customer)
FILESTREAM_ON MyFSPartScheme;
GO

That works! Now - the BIG issue with this setup is that switching partitions won't work while the unaligned index is enabled. So how to disable it? First we need to find out what it's called:

SELECT * FROM sys.indexes WHERE [object_id] = OBJECT_ID ('FileStreamTest');
GO

And then disable it:

ALTER INDEX UQ__FileStre__8CC33161060DEAE8 ON FileStreamTest DISABLE;

Now you can do partition switching. Here's the catch - to re-enable the index you need to REBUILD it - which is a size of data operation! The upshot of all this is that partitioning can be made to work with FILESTREAM data but partition switching is no longer a metadata-only operation.

Hopefully this will be addressed for V2.

A couple of weeks ago I blogged about the three tracks of the SQL Server 2008 JumpStart course that SQLskills.com taught internally for Microsoft and some MVPs - see here for details. Well, the content is now available to download! Note that this was based on CTP-5 (November 2007 CTP) and there have been *lots* of behavioral changes since then (with more planned for CTP-6 Refresh and RTM), but if you want a high-level overview of a bunch of the features (albeit in bullet-point summaries on slides) then this is a good place to start.

For me, what's *REALLY* cool is that the site also has a downloadable VPC plus lab manuals for all of the AlwaysOn High-Availability hands-on labs that SQLskills.com wrote. We originally wrote these labs for SQL Server 2005 and I updated them all for CTP-5. The VPC has a long lab on each of the following:

  • Database Snapshots
  • Data Recovery and Preventative Techniques
  • Instant Initialization
  • Peer-to-Peer Replication (including the new Topology Wizard I blogged about here)
  • Table and Index Partitioning
  • Snapshot Isolation
  • Online Operations
  • Database Mirroring (including a demo I wrote of Automatic Page Repair, described here)
  • Service Oriented Database Architecture

There is some great depth in each of these - Kimberly blogged more info about the exercises in each lab here. There's also another VPC image with some higher-level labs on a variety of 2008 features and written by a number of different people- including some labs on Policy-Based Management and Performance Data Collection that Kimberly wrote.

So - where can you get these from? Go to http://sqlserver2008jumpstart.microsofttraining.com/ and hit the Download link on the right-hand side. Register and then you can get to the materials. The AlwaysOn VPC image is Collection 2 at the bottom of the page, and you'll see all the slide decks as you scroll down the page.

Enjoy!

Phew - last week Kimberly and I spent 3 days teaching the ins-and-outs of SQL Server 2008 for DBAs/IT-Pros to about 130 Microsoft SQL Server experts and MVPs (like Kalen Delaney, Adam Machanic and Ron Talmage). This was the (95% complete) Beta delivery of a course we've been developing for the last six months for Microsoft that they'll use to train their SQL experts around the world on the new release. It's been very interesting watching the features develop through the CTPs (especially since I left the fold last August) - and making demos work on pre-release builds of the CTPs.

Teaching the course was a *blast* - the thing I love about teaching a really geeky crowd is the plethora of great questions and opportunities for going deep with explanations. Our team actually wrote and delivered the concurrently presented Developer and BI tracks as well. As you can see from the list below (and this is just the features a DBA needs to use/know about), SQL Server 2008 isn't a dot release of Yukon at all, as some people have suggested. Over the three days we covered:

  • Database Mirroring (D)
  • Backup Compression
  • Peer-to-Peer Replication (D)
  • Transparent Data Encryption (D)
  • Extensible (Off-Box) Key Management
  • All Actions Audited (D)
  • Policy-Based Management
  • Resource Governor (D)
  • Extended Events (D)
  • Spatial Indexes
  • Integrated Full-Text Search
  • Sparse Columns (D)
  • Filtered Indexes
  • Change Tracking
  • Change Data Capture (D)
  • FILESTREAM (D)
  • Performance Data Collection
  • Query Optimizer Enhancements
  • Data Compression (D)
  • Service Broker
  • Partition-Level Lock Escalation (D)

The features marked with a (D) are ones I demo'd during the course (Kimberly demo'd a bunch of the others - especially the tools features). Some of the demos were challenging to make work in time as we only got a pre-CTP6 build mid-January just before we headed off to China.

So why am I posting this? Well, a bunch of these features are in CTP-6, which should be just around the corner, and I have some easy-to-understand demos of them that I'll be posting here over the next month or so. Also, if this course sounds interesting, Kimberly and I will be teaching it in various configurations over the next year - starting with SQL Connections in April, a soon-to-be-announced class in Iceland in March, and the ITPro portion of TechEd in June.

Watch this space starting next week (today's the last day of six straight weeks of teaching for us so this weekend's a break :-))

Back in October 2007 I blogged about partition-level lock escalation in SQL Server 2008 (see here) and I promised to do a follow-up once CTP-5 came out with syntax etc. So here it is.

A brief recap - lock escalation in SQL Server 2005 and before only allowed table-level lock escalation. If you have a partitioned table with queries going against different partitions, then table-level escalation is a pain because the whole table is suddenly locked and concurrent queries against distinct partitions can't run. SQL Server 2008 gives the ability to escalate to a parttition lock, which won't affect the queries on the other partitions.

The lock escalation policy can only be set with ALTER TABLE after a table has been created, and the policy can only be set at the table level. The syntax is

ALTER TABLE TableName SET (LOCK_ESCALATION = TABLE | AUTO | DISABLE);

The options mean:

  • TABLE - escalation will always be to the table level. This is the default.
  • AUTO - escalation will be to the partition level if the table is partitioned; otherwise it will be to the table level
  • DISABLE - escalation will be disabled. This does not guarantee that it will NEVER occur - there are some cases where it is necessary (Books Online gives the example of scanning a heap in the SERIALIZABLE isolation level)

The only way I could find to check what the escalation policy for a table is set to is to use the sys.tables catalog view:

SELECT lock_escalation_desc FROM sys.mytables WHERE name = 'TableName';

Let's try it out. Here's a script that creates a database with an example table with 3 partitions. The partition ranges are negative infinity to 7999, 8000 to 15999, 16000 to positive infinity.

CREATE DATABASE LockEscalationTest;
GO

USE LockEscalationTest;
GO

-- Create three partitions: -7999, 8000-15999, 16000+
CREATE PARTITION FUNCTION MyPartitionFunction (INT) AS RANGE RIGHT FOR VALUES (8000, 16000);
GO

CREATE PARTITION SCHEME MyPartitionScheme AS PARTITION MyPartitionFunction
ALL TO ([PRIMARY]);
GO

-- Create a partitioned table
CREATE TABLE MyPartitionedTable (c1 INT);
GO

CREATE CLUSTERED INDEX MPT_Clust ON MyPartitionedTable (c1)
ON MyPartitionScheme (c1);
GO

-- Fill the table
SET NOCOUNT ON;
GO

DECLARE @a INT = 1;
WHILE (@a < 17000)
BEGIN
INSERT INTO MyPartitionedTable VALUES (@a);
SELECT @a = @a + 1;
END;
GO

Now I'm going to explicitly set the escalation to TABLE and start a transaction that should cause lock escalation.

ALTER TABLE MyPartitionedTable SET (LOCK_ESCALATION = TABLE);
GO

BEGIN TRAN
UPDATE
MyPartitionedTable SET c1 = c1 WHERE c1 < 7500;
GO

We should be able to see the locks being held:

SELECT [resource_type], [resource_associated_entity_id], [request_mode],
[request_type], [request_status] FROM sys.dm_tran_locks WHERE [resource_type] <> 'DATABASE';
GO

resource_type   resource_associated_entity_id request_mode   request_type   request_status
--------------- ----------------------------- -------------- -------------- ----------------
METADATA        0                             Sch-S          LOCK           GRANT
METADATA        0                             Sch-S          LOCK           GRANT
METADATA        0                             Sch-S          LOCK           GRANT
OBJECT          2105058535                    X              LOCK           GRANT

Just as we expected - an X table lock. Trying any query against the table fails now. Now I'll rollback that transaction, set the escalation to partition-level and try again.

ROLLBACK TRAN;
GO

ALTER TABLE MyPartitionedTable SET (LOCK_ESCALATION = AUTO);
GO

BEGIN TRAN
UPDATE
MyPartitionedTable SET c1 = c1 WHERE c1 < 7500;
GO

SELECT [partition_id], [object_id], [index_id], [partition_number]
FROM sys.partitions WHERE object_id = OBJECT_ID ('MyPartitionedTable');
GO

SELECT [resource_type], [resource_associated_entity_id], [request_mode],
[request_type], [request_status] FROM sys.dm_tran_locks WHERE [resource_type] <> 'DATABASE';
GO

partition_id         object_id   index_id    partition_number
-------------------- ----------- ----------- ----------------
72057594039042048    2105058535  1           1
72057594039107584    2105058535  1           2
72057594039173120    2105058535  1           3

resource_type   resource_associated_entity_id request_mode   request_type   request_status
--------------- ----------------------------- -------------- -------------- ----------------
HOBT            72057594039042048             X              LOCK           GRANT
METADATA        0                             Sch-S          LOCK           GRANT
METADATA        0                             Sch-S          LOCK           GRANT
METADATA        0                             Sch-S          LOCK           GRANT
OBJECT          2105058535                    IX             LOCK           GRANT

Excellent - the object lock is now IX rather than X, and the X lock is at the partition (HOBT) level for partition 1 (see the bold highlighting to match the partition ID with the lock resource). (For an explanation of HOBTs, see my post Inside The Storage Engine: IAM pages, IAM chains, and allocation units.) So now we should be able to do something with another partition - let's see if we can cause another partition level X lock in another connection:

USE LockEscalationTest;
GO

BEGIN TRAN
UPDATE
MyPartitionedTable set c1 = c1 WHERE c1 > 8100 AND c1 < 15900
;
GO

SELECT [partition_id], [object_id], [index_id], [partition_number]
FROM sys.partitions WHERE object_id = OBJECT_ID ('MyPartitionedTable'
);
GO

SELECT [resource_type], [resource_associated_entity_id], [request_mode],
[request_type], [request_status] FROM sys.dm_tran_locks WHERE [resource_type] <> 'DATABASE'
;
GO

partition_id         object_id   index_id    partition_number
-------------------- ----------- ----------- ----------------
72057594039042048    2105058535  1           1
72057594039107584    2105058535  1           2
72057594039173120    2105058535  1           3

resource_type   resource_associated_entity_id request_mode   request_type   request_status
--------------- ----------------------------- -------------- -------------- ----------------
HOBT            72057594039107584             X              LOCK           GRANT
HOBT            72057594039042048             X              LOCK           GRANT
METADATA        0                             Sch-S          LOCK           GRANT
METADATA        0                             Sch-S          LOCK           GRANT
METADATA        0                             Sch-S          LOCK           GRANT
OBJECT          2105058535                    IX             LOCK           GRANT
OBJECT          2105058535                    IX             LOCK           GRANT

Now we have two partition X locks, for partitions 1 and 2 (as expected - use the color coding above to match up the IDs), plus two table-level IX locks (one for each  connection, as expected). Very cool!

Now I'm going to force a deadlock - by having each connection try to read a row from the other locked partition:

Connection 1:

SELECT * FROM MyPartitionedTable WHERE c1 = 8500;
GO

Conneciton 2:

SELECT * FROM MyPartitionedTable WHERE c1 = 100;
GO

Connection 2 succeeds but on connection 1 we get (as expected):

(local)\SQLDEV01(SQLHAVPC\Administrator): Msg 1205, Level 13, State 18, Line 1
Transaction (Process ID 51) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.

This illustrates a potential problem with this new mechanism - applications that used to rely on the blocking nature of X table locks may now exhibit deadlocks if partition-level escalation is turned on in production without any testing. In fact, this mode was specifically chosen NOT to be the default setting for new tables because some trial workloads exhibited deadlocks during testing. Don't just turn it on in production without testing - as with any other option or feature.

A quickie today to get back into the swing of things.

In Kimberly's whitepaper on partitioning she discusses the 'sliding window' scenario (where you switch in and out partitions of data into an existing table - see this previous post for a few more details). She recommends that the constraints are extended rather than dropped and recreated - which I totally agree with. I had a question about why this is a best practice, and is it more efficient than dropping and recreating the constraints?

Let's create a little example to illustrate all these points. A simple table called Sales with a couple of indexes and 100000 rows of data.

CREATE TABLE Sales (salesID INT IDENTITY, SalesDate DATETIME);
GO

CREATE CLUSTERED INDEX Sales_CL ON Sales (SalesID);
CREATE NONCLUSTERED INDEX Sales_NCL ON Sales (SalesDate
);
GO

SET NOCOUNT ON;
GO

DECLARE @a INT;
SELECT @a = 1
;
WHILE (@a < 100000
)
BEGIN

INSERT INTO Sales VALUES (GETDATE ());
SELECT @a = @a + 1;

END;
GO

Now I want to create two constraints - for the lower and upper bounds of the sales date. I could do this using a single constraint with both conditions or two constraints with a single condition each. For simplicity I'll use one constraint, but first I want to see how expensive the operation is, so I'm going to turn on STATISTICS IO - this is a very cool feature that gives the IO costs of a query after it's completed.

SET STATISTICS IO ON;
GO

ALTER TABLE Sales WITH CHECK ADD CONSTRAINT [CK_Sales_SalesDate] CHECK (YEAR (SalesDate) >= 2005 AND YEAR (SalesDate) < 2008);
GO

Table 'Sales'. Scan count 1, logical reads 399, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

The 399 logical reads are for the table scan that's done to ensure that the constraint is valid for the data currently in the table.

So - the first question is why not drop/create the constraint to update it? Well, what if invalid data is entered into the table between dropping and recreating the constraint?

ALTER TABLE Sales DROP CONSTRAINT [CK_Sales_SalesDate];
GO

ALTER TABLE Sales WITH CHECK ADD CONSTRAINT [CK_Sales_SalesDate] CHECK (YEAR (SalesDate) >= 2005 AND YEAR (SalesDate) < 2009);
GO

Msg 547, Level 16, State 0, Line 1

The ALTER TABLE statement conflicted with the CHECK constraint "CK_Sales_SalesDate". The conflict occurred in database "ConstraintTest", table "dbo.Sales", column 'SalesDate'.

The constraint can't be recreated and you have to find the invalid data and get rid of it - which may not be as easy as it sounds depending on your schema and business logic. So, the best practice is always to update a constraint. Well, you can't update an existing constraint but you can create a new constraint with different conditions and then drop the old constraint:

ALTER TABLE Sales WITH CHECK ADD CONSTRAINT [CK_Sales_SalesDate2] CHECK (YEAR (SalesDate) >= 2005 AND YEAR (SalesDate) < 2009);
ALTER TABLE Sales DROP CONSTRAINT [CK_Sales_SalesDate]
;
GO

Table 'Sales'. Scan count 1, logical reads 399, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

The second question is which way is more efficient? The answer is neither. Dropping and recreating the constraint will obviously do a table scan again, but so does adding the new constraint - even though there's a trusted constraint in place already which guarantees that adding the new constraint can't possibly fail!!! Hopefully in the future the smarts will be built into the SQL Engine to recognize this and not do the unnecessary table scan (this process is known as interval subsumption - according to my geeky wife :-))

This is a quick answer to a question I was sent today by someone who'd read Kimberly's partitioning whitepaper - Partitioned Tables and Indexes in SQL Server 2005 - and is implementing a "sliding-window" scenario. (This is a mechanism to allow fast insertion and deletion of significant portions of data into/from a partitioned production table. Insertion is done by taking a table and making it a new partition of the production table - called switching-in. Deletion is done by removing a partition from the production table and making it into a stand-alone table - called switching-out.)

The question is - what indexes are required on the staging table to prevent the ALTER TABLE ... SWITCH PARTITION statement from failing with a message like that below:

Msg 4947, Level 16, State 1, Line 1
ALTER TABLE SWITCH statement failed. There is no identical index in source table 'PartitionTest.dbo.StagingTable' for the index 'NC_Birthday' in target table 'PartitionTest.dbo.ProductionTable'.

The answer is that the staging table has to have the exact same indexes - clustered and non-clustered - as the production table. I asked Kimberly if it has to have the same constraints too - the answer is yes, plus the staging table has to have a trusted constraint on it such that SQL Server can tell (without checking all the data in the staging table) that all the data satisfies the partitioning function for the partition that you're switching-in (i.e. the partition that the staging table will become in the production table). If it doesn't, the switching-in will fail with the following error:

Msg 4982, Level 16, State 1, Line 1
ALTER TABLE SWITCH statement failed. Check constraints of source table 'PartitionTest.dbo.StagingTable' allow values that are not allowed by range defined by partition 4 on target table 'PartitionTest.dbo.ProductionTable'.

One thing that confuses people is that SQL Server does not create the target table for you when doing a switch-out of a partition. The target table has to exist and have the exact same schema as the production table. Also, it has to be completely empty - otherwise you'll get an error like:

Msg 4905, Level 16, State 1, Line 1
ALTER TABLE SWITCH statement failed. The target table 'PartitionTest.dbo.StagingTable' must be empty.

The must-be-empty requirement also holds for switching-in operations - the partition that will be created has to be empty otherwise a similar 4904 error results.

Hope this helps!

This is a combo from some previously posted material, with some more DBCC PAGE output thrown in.

IAM pages

An IAM (Index Allocation Map) page tracks approximately 4GB worth of space in a single file, aligned on a 4GB boundary. These 4GB chunks are called 'GAM intervals'. An IAM page tracks which extents within that specific GAM interval belongs to a single entity (I'm chosing my words carefully here and not using any word that has SQL Server connotations like 'object').

An IAM page can only track the space for a single GAM interval in a single file so If the database has multiple files, or some files are more then 4GB, and the entity has space allocated from multiple files or multiple GAM intervals within a file, then you can see how multiple IAM pages are needed for each entity to track all the space that its using. If an entity requires multiple IAM pages to track all its extents, then they IAM page have to be linked together. That's where an IAM chain comes in. More on these below.

Each IAM page has two records, an IAM page header and a bitmap. Let's look at one with DBCC PAGE. I'm using the database from the page split post. Doing a DBCC IND on the table we created gives us:

SEQA3.jpg

By looking at the PageType column, we can see that there's an IAM page (a page with type 10 - see the post on Anatomy of a page for more details) with page ID (1:152):

DBCC TRACEON (3604);

GO

DBCC PAGE ('pagesplittest', 1, 152, 3);

GO

m_pageId = (1:152)                   m_headerVersion = 1                  m_type = 10
m_typeFlagBits = 0x0                 m_level = 0                          m_flagBits = 0x200
m_objId (AllocUnitId.idObj) = 68     m_indexId (AllocUnitId.idInd) = 256
Metadata: AllocUnitId = 72057594042384384
Metadata: PartitionId = 72057594038386688                                 Metadata: IndexId = 1
Metadata: ObjectId = 2073058421      m_prevPage = (0:0)                   m_nextPage = (0:0)
pminlen = 90                         m_slotCnt = 2                        m_freeCnt = 6
m_freeData = 8182                    m_reservedCnt = 0                    m_lsn = (18:116:13)
m_xactReserved = 0                   m_xdesId = (0:0)                     m_ghostRecCnt = 0
m_tornBits = -1947725876

Allocation Status

GAM (1:2) = ALLOCATED                SGAM (1:3) = ALLOCATED
PFS (1:1) = 0x70 IAM_PG MIXED_EXT ALLOCATED   0_PCT_FULL                  DIFF (1:6) = CHANGED
ML (1:7) = NOT MIN_LOGGED

IAM: Header @0x620CC064 Slot 0, Offset 96

sequenceNumber = 0                   status = 0x0                         objectId = 0
indexId = 0                          page_count = 0                       start_pg = (1:0)


IAM: Single Page Allocations @0x620CC08E

Slot 0 = (1:143)                     Slot 1 = (1:153)                     Slot 2 = (1:154)
Slot 3 = (0:0)                       Slot 4 = (0:0)                       Slot 5 = (0:0)
Slot 6 = (0:0)                       Slot 7 = (0:0)


IAM: Extent Alloc Status Slot 1 @0x620CC0C2

(1:0)        - (1:272)      = NOT ALLOCATED

Some things to note about the page header itself:

  • The page has type 10, as we'd expect
  • The previous and next page pointers are NULL, because there aren't any other IAM pages in this IAM chain
  • The slot count is 2 - one for the IAM header record and one for the bitmap itself
  • The page is almost entirely full

The IAM page header has the following fields:

  • sequenceNumber
    • This is the position of the IAM page in the IAM chain. This increases by one for each page added to the IAM chain.
  • status
    • This is unused.
  • objectId
  • indexId
    • On SQL Server 2000 and before, these are the object  and index IDS that the IAM page is part of. On SQL Server 2005 and later they are unused.
  • page_count
    • This is unused  - it used to be the number of page IDs that are being tracked in the single page allocation array.
  • start_pg
    • This is the GAM interval that the page maps. It stores the first page ID in the mapped interval.
  • Single Page Allocations array
    • These are the pages that have been allocated from mixed extents. This array is only used in the first IAM page in the chain (as the whole IAM chain only need to track at most 8 single-page allocations).

The bitmap occupies the rest of the IAM page and has a bit for each extent in the GAM interval. The bit is set if the extent is allocated to the entity, and clear if it is not. Obviously two IAM pages that map the same GAM interval for different entities cannot both have the same bit set - this is checked by DBCC CHECKDB. In the output from DBCC PAGE above, you can see that there are no extents allocated to the table. You'll notice that the output only goes up to the extent starting at page 272 in the file - this is because the data file is only that big. I inserted a bunch more rows into the table and did another DBCC PAGE of the IAM page. This time the DBCC PAGE output contains:

IAM: Single Page Allocations @0x620CC08E

Slot 0 = (1:143)                     Slot 1 = (1:153)                     Slot 2 = (1:154)
Slot 3 = (1:155)                     Slot 4 = (1:156)                     Slot 5 = (1:157)
Slot 6 = (1:158)                     Slot 7 = (1:159)


IAM: Extent Alloc Status Slot 1 @0x620CC0C2

(1:0)        - (1:152)      = NOT ALLOCATED
(1:160)      - (1:296)      =     ALLOCATED
(1:304)      - (1:400)      = NOT ALLOCATED

You can see that the entire single-page allocation array is full and then allocations switched to dedicated extents. The first available extent must have been the one starting at page 160 and all extents up to an including the one starting at page 296 are now allocated. Note also that the file must have grown because the output now goes up to page 400 in the file.

A couple more things to note about IAM pages:

  • There are themselves single-page allocations from mixed extents and are not tracked anywhere
  • They can be allocated from any file to track extents in any other file

IAM chains

If we continued to grow the file and fill up the table then eventually we'd need another IAM page to map the next GAM interval. This is where an IAM chain comes in. It's a linked-list of IAM pages that track the space allocated to a single entity. The linked-list is not sorted at all - IAM pages are appended to it in the order that they're needed. The IAM pages within the list are numbered, again, in the order that they were appended to the list.

Definition of 'entity' - what uses an IAM chain? This is vastly different in SQL Server 2000 and 2005.

In SQL Server 2000, a single IAM chain is used for each:

  • Heap or clustered index
    • A table can only have one or the other, not both. These have index IDs of 0 and 1 respectively.
  • Non-clustered index
    • These have index IDs from 2 to 250 (i.e. you can only have 249 of them)
  • Table's complete LOB storage
    • For LOB columns (text, ntext, image) in the heap or clustered index. This is sometimes called the 'text index' and has a fixed index ID of 255.

This gives a maximum of 251 IAM chains per object in SQL Server 2000 and before. I usually generalize and say that in SQL Server 2000, there's one IAM chain per index (which fits nicely if you remember that IAM stands for Index Allocation Map).

Allocation units (SQL Server 2005 and later)

Now in SQL Server 2005 and later, things have changed a lot. IAM chains and IAM pages are exactly the same, but what they correspond to is different. A table can now have up to 750000 IAM chains! There are now three things that IAM chains map space allocations for:

  1. heaps and b-trees (a b-tree is the internal structure used to store an index)
  2.  LOB data
  3. row-overflow data

and we now call these units of space allocation tracking allocation units. The internal names for these three types of allocation unit are (respectively):

  1. hobt allocation unit (Heap Or B-Tree, pronounced 'hobbit', yes, as in Lord Of The Rings)
  2. LOB allocation unit
  3. SLOB allocation unit (Small-LOB or Short-LOB)

and the external names are, respectively:

  1. IN_ROW_DATA allocation unit
  2. LOB_DATA allocation unit
  3. ROW_OVERFLOW_DATA allocation unit

They couldn't really continue to be called IAM chains, because they're no longer tracking space allocation for an index. However, they're chain of IAM pages is still called an IAM chain, and the unit of tracking is now called an allocation unit. Apart from that, there's no difference.

Let's have a quick look at three new features in SQL Server 2005 that made these changes necessary and boosted the number of potential IAM chains per table.

Included Columns
This is the ability for non-clustered indexes to include non-key columns at the leaf-level. This is useful for three reasons:

  1. Iit allows a non-clustered index to truly cover a query where the query results include more than 16 columns or the combination of column lengths in the query results is greater than 900 bytes (remember that a non-clustered index key is limited to 16 columns and 900 bytes).
  2. It allows columns to be include in the non-clustered index that have data types that cannot be part of an index key (e.g. varchar(max) or XML).
  3. It allows a non-clustered index to cover a query without having to have all the columns included in the index key. As the index key is included in rows at all levels of the b-tree, this allows the index to be smaller.

An example of space saving: imagine a 100 million row index, with a key length of 900 bytes, but only the first two integer keys are really needed as the index key, the other 4 fixed-length columns could be stored in the index as included columns. With the 900 byte key, 8 rows can fit per database page (i.e. the fanout is 8). This means there will be 12500000 pages at the leaf level, 1562500 pages at the next level up in the b-tree and so on, giving a total of 12500000 + 1562500 + 195313 + 24415 + 3052 + 382 + 48 + 6 + 1 = 14285717 pages (including 1785717 to store the upper levels of the b-tree).

If we go with the included columns method then the key size shrinks to 8 bytes, and with the row overhead we can get the row length in the upper levels of the b-tree down to 15 bytes (giving a fanout of approx. 537). Note that the fanout at the leaf-level is still going to be 8,  because the amount of data stored in each row at the leaf-level is the same. So, this means there will be 12500000 pages at the leaf level, 23278 pages at the next level up and so on, giving a total of 12500000 + 23278 + 44 + 1 = 12523323 pages (including 23323 to store the upper levels of the b-tree). Compared to the full-size 900-byte key, this is a 12% saving of 1762394 pages, or 13.6GB! Obviously this is a contrived case but you can see how the savings can occur.

The main reason for adding this feature it to enable true covering queries. A covering query is one where the query optimizer knows it can get all the query results from the non-clustered index and so the query can be satisfied without having to incur the extra IOs of looking up data in the base table - a significant performance saving.

Now that non-clustered indexes can have included columns, and those columns can be LOB data types (but only the new ones in SQL Server 2005 - varchar(max), nvarchar(max), varbinary(max), and XML). This means that having a single LOB allocation unit (as in the case of the single text index in SQL Server 2000) isn't possible any more because each index may have its own set of LOBs. Now, you may ask why there isn't just a single set of LOBs with multiple references from various indexes plus the base table. We considered that during SQL Server 2005 development but it would have made things a lot more complicated.

So, with this new feature, each index needs two allocation units - one for the data or index rows (the hobt allocation unit) and one for any LOB data.

Large Rows

One of the things that has plagued schema designers for a long time is the 8060 byte limit on table row sizes so this restriction was removed in SQL Server 2005. The way this is done is to allow variable-length columns (e.g. varchar, sqlvariant) to get pushed off-row when the row size gets too big to fit on a single page.

But where do these column values get pushed to? They're effectively turned into mini LOB columns. The column value in the row is replaced with a 16-byte pointer to the off-row column value, which is stored as if its a LOB value in a seperate allocation unit - the row-overflow (or SLOB) allocation unit. These values are stored in text pages in exactly the same way as regular LOB values are, just using a separate allocation unit. The SLOB allocation unit is only created when the first column value is pushed off-row.

This feature works for non-clustered indexes too - if you consider the ability to have included columns in non-clustered indexes then you could easily have non-clustered index rows that won't fit on a page. It would have been short-sighted of to get rid of the 900-byte limit and replace it with an 8060-byte limit by not extending the row-overflow feature to non-clustered indexes too.

Now with the addition of this new feature, each index can have up to three allocation units - hobt, LOB, and SLOB. Even with this, that only makes a maximum of 750 IAM chains per table (remember an IAM chain now maps the storage allocations for an allocation unit, so 250 indexes * 3 allocation units = 750 IAM chains). But I mentioned 750 thousand IAM chains per table earlier - where do all the rest come from?

Partitioning

This is what gives us the 1000x multiplier. As you may already know, partitioning is the new feature that allows tables and indexes to be split into a series of ranges, with each range stored separately (most commonly in seperate filegroups). Partitioning is a topic for a separate post.

If each range or partition of the table or index is stored seperately, then each is going to need its own hobt allocation unit. Of course, the LOB values associated with each partition need to be stored with it, and so each partition also needs a LOB allocation unit. Also, the row-overflow feature is per-row, and so rows in each partition will overflow into SLOB allocation units just as for un-partitioned tables and indexes. Thus each partition of a table or index can have up to three allocation units (and hence three IAM chains).

Still, where does that 1000 come in? Each table or index can have up to 1000 partitions. This gives us 250 indexes x 1000 partitions x 3 allocation units = 750000 IAM chains. Realistically this probably won't happen, but it is possible.

In SQL Server 2005, queries over partitioned tables use a single-thread per partition. This can cause performance problems under certain circumstances:

  1. On systems with many CPUs, if the table schema has less partitions than there are CPUs, then not all the CPUs will be used to process the query. Some examples:
    1. On a 32-way box, a query over a 12-partition table (e.g. a sales table partitioned by month) will only use 12 threads (one on each of 12 CPUs). This means 20 CPUs are potentially idle.
    2. On a 4-way box, a query over a 12-partition table only accesses a single partition so will only use one thread (on one CPU). This means 3 CPUs are potentially idle.
  2. On tables that have skewed data such that one partition is much larger than another, the length of time the query takes to complete will be bounded by the single thread processing the largest partition.

As part of the set of improvements in SQL Server 2008 for data warehousing there will be an option to change the threading behavior for queries over partitioned tables. The new, alternative model is that all available threads process part of each partition and then move into the next partition. This allows all available CPUs to take part in processing the query, which should lead to a drop in the query completion time.

The only time this model won't work is if the data is not in the buffer pool and is not spread out evenly across the available drives. For example, if an entire partition is stored on a single drive, then multiple threads will be scanning different portions of the drive, causing the disk head to thrash and IO throughput to drop sharply compared with a single thread driving the IO. For this reason, the option to use the new model will be off by default, to avoid surprising people with sudden bad performance after upgrading.

This should be available in the next CTP and then I'll post again with some example datasets and queries to see what the potential benefits and drawbacks are.

Theme design by Nukeation based on Jelle Druyts