This is a quick post to clarify an article I saw on SQLServerCentral this morning that seemed to state that transaction rollbacks push data into database snapshots. This is absolutely not true.

A database page is copied into a database snapshot before it is changed in the source database. Although the mechanism is commonly called copy-on-write, it's more technically accurate to call it copy-before-write (but this makes it a bit harder to understand for many people). Once a page has been copied into the database snapshot, it is never removed from the database snapshot, and won't ever be copied into it again, as the database snapshot already has the correct point-in-time copy of the updated page. (For more info on database snapshots in general, see the Books Online entry Database Snapshots.)

A transaction makes one or more changes to the database, updating one or more pages. These pages will be copied into the database snapshot if they're not already there, so the pre-change image (that existed at the time the database snapshot was created) is preserved.

If the transaction rolls back, the rollback occurs by generating the reverse operations that the transaction performed and applying them to the database (e.g. an insert will be rolled back by the generation and application of a delete; an update will be rolled back by replacing the updated parts of the record with the pre-update values). I'll explain more about this in a future blog post.

These rollback operations will occur on the same pages that the initial transaction operations occured on. This means that no other pages will be changed by the rollback operations and so no further pages will be copied into the database snapshot by a rollback. Pages can't be removed from the database snapshot when the transaction rolls back because they have still changed in the source database (although the net effect of the transaction+rollback is no logical changes to the data, the page headers will have changed to have an updated Log Sequence Number on), and so the copy in the database snapshot is still required to preserve the point-in-time view of the database (at the physical level) as of the time the database snapshot was created.

I'll prove this to you with a simple script that you can play around with to convince yourself also.

USE master;
GO

DROP DATABASE SnapRollbackTest_Snapshot;
GO
DROP DATABASE SnapRollbackTest;
GO

CREATE DATABASE SnapRollbackTest;
GO
USE SnapRollbackTest;
GO

CREATE TABLE MyTable (c1 INT);
CREATE CLUSTERED INDEX MyTable_CL ON MyTable (c1);
GO

SET NOCOUNT ON;
GO

DECLARE @a INT;
SELECT @a = 1;
WHILE (@a < 100001)
BEGIN
    INSERT INTO MyTable (c1) VALUES (@a);
    SELECT @a = @a + 1;
END;
GO

CREATE DATABASE SnapRollbackTest_Snapshot ON
    (NAME = N'SnapRollbackTest', FILENAME = N'C:\SQLskills\test\SnapRollbackTest.mdfss')
AS SNAPSHOT OF SnapRollbackTest;
GO

-- Initial size
SELECT size_on_disk_bytes AS [Initial Size (bytes)] FROM sys.dm_io_virtual_file_stats (DB_ID ('SnapRollbackTest_Snapshot'), 1);
GO

-- Start transaction
BEGIN TRAN
GO
UPDATE MyTable SET c1 = 42;
GO
CHECKPOINT; -- to make sure absolutely everything is flushed to disk
GO

SELECT size_on_disk_bytes AS [After Transaction (bytes)] FROM sys.dm_io_virtual_file_stats (DB_ID ('SnapRollbackTest_Snapshot'), 1);
GO

-- Rollback
ROLLBACK TRAN;
GO

SELECT size_on_disk_bytes AS [After Rollback (bytes)] FROM sys.dm_io_virtual_file_stats (DB_ID ('SnapRollbackTest_Snapshot'), 1);
GO

Initial Size (bytes)
--------------------
196608

After Transaction (bytes)
-------------------------
1835008

After Rollback (bytes)
----------------------
1835008

You can clearly see that the size of the database snapshot did NOT increase at all because of the transaction rollback. Using the script above you can try this using a heap, clustered index, various combinations of row size and number of rows - the result will be the same - the database snapshot will not increase in size because of a transaction rollback. I tried a bunch of different combinations, all with the same result.

In the back of my head there's a niggly feeling that there's a funky, rare, pathalogical case where some weird combination of operations results in a page split when rolled-back, but I can't engineer it.

Bottom line - transaction rollbacks do not cause the database snapshot to increase in size, as the rollback operates on the database pages that have already been copied into the snapshot because they changed due to the operations of the transaction itself.

Hope this helps!

PS As one of the commenters pointed out, the initial snapshot size can be affected by the crash-recovery that is run when the snapshot is created. I go into details on that process in this post as it can be confusing when CHECKDB runs.

(Been a while longer than usual since blog posts - I've been really busy flying around the country doing stuff onsite with clients. Normal service will be resumed at the start of July :-)

One of my biggest hot-buttons is around shrinking data files. Although I used to own the shrink code while I was at Microsoft, I never had a chance to rewrite it so that data file shrink is a more palatable operation. I really don't like shrink.

Now, don't confuse shrinking the transaction log with shrinking data files. Shrinking the log is necessary if your log has grown out of control, or as part of a process to remove excessive VLF fragmentation (see Kimberly's excellent posts on this here and here). However, shrinking the log should be a rare operation and should not be part of any regular maintenance you perform.

Shrinking of data files should be performed even more rarely, if at all. Here's why - data file shrink causes *massive* index fragmentation. Let me demonstrate with a simple script you can run. The script below will create a data file, create a 10MB 'filler' table at the start of the data file, create a 10MB 'production' clustered index, drop the 'filler' table and then run a shrink to reclaim the space.

USE MASTER;
GO

IF DATABASEPROPERTYEX ('DBMaint2008', 'Version') > 0
 DROP DATABASE DBMaint2008;

CREATE DATABASE DBMaint2008;
GO
USE DBMaint2008;
GO

SET NOCOUNT ON;
GO

-- Create the 10MB filler table at the 'front' of the data file
CREATE TABLE FillerTable (c1 INT IDENTITY,  c2 CHAR (8000) DEFAULT 'filler');
GO

-- Fill up the filler table
INSERT INTO FillerTable DEFAULT VALUES;
GO 1280

-- Create the production table, which will be 'after' the filler table in the data file
CREATE TABLE ProdTable (c1 INT IDENTITY,  c2 CHAR (8000) DEFAULT 'production');
CREATE CLUSTERED INDEX prod_cl ON ProdTable (c1);
GO

INSERT INTO ProdTable DEFAULT VALUES;
GO 1280

-- check the fragmentation of the production table
SELECT [avg_fragmentation_in_percent] FROM sys.dm_db_index_physical_stats (
    DB_ID ('DBMaint2008'), OBJECT_ID ('ProdTable'), 1, NULL, 'LIMITED');
GO

-- drop the filler table, creating 10MB of free space at the 'front' of the data file
DROP TABLE FillerTable;
GO

-- shrink the database
DBCC SHRINKDATABASE (DBMaint2008);
GO

-- check the index fragmentation again
SELECT [avg_fragmentation_in_percent] FROM sys.dm_db_index_physical_stats (
    DB_ID ('DBMaint2008'), OBJECT_ID ('ProdTable'), 1, NULL, 'LIMITED');
GO

avg_fragmentation_in_percent
----------------------------
0.390625

DbId   FileId      CurrentSize MinimumSize UsedPages   EstimatedPages
------ ----------- ----------- ----------- ----------- --------------
6      1           1456        152         1448        1440
6      2           63          63          56          56

DBCC execution completed. If DBCC printed error messages, contact your system administrator.

avg_fragmentation_in_percent
----------------------------
99.296875

Look at the output from the script! The logical fragmentation of the clustered index before the shrink is a near-perfect 0.4%. After the shrink, it's almost 100%. The shrink operation *completely* fragmented the index, removing any chance of efficient range scans on it by ensuring the all range-scan readahead I/Os will be single-page I/Os.

Why does this happen? A data file shrink operation works on a single file at a time, and uses the GAM bitmaps (see Inside The Storage Engine: GAM, SGAM, PFS and other allocation maps) to find the highest page allocated in the file. It then moves it as far towards the front of the file as it can, and so on, and so on. In the case above, it completely reversed the order of the clustered index, taking it from perfectly defragmented to perfectly fragmented.

The same code is used for DBCC SHRINKFILE, DBCC SHRINKDATABASE, and auto-shrink - they're equally as bad. As well as introducing index fragmentation, data file shrink also generates a lot of I/O, uses a lot of CPU, and generates *loads* of transaction log - as everything it does is fully logged.

Data file shrink should never be  part of regular maintenance, and you should NEVER, NEVER have auto-shrink enabled. I tried to have it removed from the product for SQL 2005 and SQL 2008 when I was in a position to do so - the only reason it's still there is for backwards compatibility. Don't fall into the trap of having a maintenance plan that rebuilds all indexes and then tries to reclaim the space required to rebuild the indexes by running a shrink - that's a zero-sum game where all you do is generate a log of transaction log for no actual gain in performance.

So what if you *do* need to run a shrink? For instance, if you've deleted a large proportion of a very large database and the database isn't likely to grow, or you need to empty a file before removing it?

The method I like to recommend is as follows:

  • Create a new filegroup
  • Move all affected tables and indexes into the new filegroup using the CREATE INDEX ... WITH (DROP_EXISTING) ON <filegroup> syntax, to move the tables and remove fragmentation from them at the same time
  • Drop the old filegroup that you were going to shrink anyway (or shrink it way down if its the primary filegroup)

Basically you need to provision some more space before you can shrink the old files, but it's a much cleaner mechanism.

If you absolutely have no choice and have to run a data file shrink operation, be aware that you're going to cause index fragmentation and you should take steps to remove it afterwards if it's going to cause performance problems. The only way to remove index fragmentation without causing data file growth again is to use DBCC INDEXDEFRAG or ALTER INDEX ... REORGANIZE. These commands only require a single 8KB page of extra space, instead of needing to build a whole new index in the case of an index rebuild operation.

Bottom line - try to avoid running data file shrink at all costs!

It's lunchtime for my class - time for a blog post! This is an interesting one that crops up every so often (just an hour ago on SQL Server Central) and is not known very widely at all.

There's a misconception that you cannot run DMVs in databases that have compatibility levels of 80 or less. It's not true.

Here's an example:

USE master;
GO
EXEC sp_dbcmptlevel AdventureWorks, 80;

USE AdventureWorks;
GO

SELECT [object_id], [index_id]
FROM sys.dm_db_index_physical_stats (
    DB_ID ('AdventureWorks'), NULL, NULL, NULL, NULL)
WHERE [avg_fragmentation_in_percent] > 30;
GO

You get the incredibly useful and explanatory message:

Msg 102, Level 15, State 1, Line 2
Incorrect syntax near '('.

What the heck does that mean? It really means that the query processor refused to recognize the DMV name because the database is running in 80 or lower compatibility level.

Here's the trick - execute the DMV in the context of database that's in 90 compatibility level or higher, and it will let you do it - even if you're actually targeting a database in a lower compatibility level.

USE master;
GO

SELECT [object_id], [index_id]
FROM sys.dm_db_index_physical_stats (
    DB_ID ('AdventureWorks'), NULL, NULL, NULL, NULL)
WHERE [avg_fragmentation_in_percent] > 30;
GO

object_id   index_id
----------- -----------
18099105    2
30623152    2
30623152    3
30623152    4
66099276    1
66099276    2

...

Note that I executed the DMV from the context of master, but targeted AdventureWorks, which is in the 80 compatibility level. 

Have fun!

Categories:
Misconceptions | Tools

There's still a widely held misconception that when properly in the FULL or BULK_LOGGED recovery models that full or differential backups can truncate the log. No. It *NEVER* happens. This is one of the reasons why I'm doing a whole spotlight session on this at PASS this year - the transaction log and its behavior is IMHO one of the most misunderstood parts of SQL Server.

Notice that I said 'when properly in the FULL or BULK_LOGGED recovery models'. If you switch recovery models to FULL or BULK_LOGGED, until you take the first full backup, you are still essentially in the SIMPLE recovery model, and so the log will truncate on checkpoint. Once you take that first full backup, you are then in I-will-manage-the-size-of-the-log-through-log-backups mode. After that, the ONLY thing that will allow the log to clear/truncate is a log backup, as long as nothing else requires those transaction log records.

If you're not familiar with the term 'log clearing' or 'log truncating', they mean exactly the same thing - part of the transaction log is marked as no longer needed and can be overwritten. Nothing is zeroed out, the log file size is not altered. Some background reading on this:

Earlier this week I was involved in a discussion about log backup size management and how to prevent a log backup following a maintenance operation to not contain details of the maintenance operation.

There's a very simple answer: you can't.

If you do an operation in the FULL or BULK_LOGGED recovery models, the next log backup will contain all information required to replay that operation. In the FULL recovery model, everything is fully logged, so the log backup will contain all the log records generated by the operation. In the BULK_LOGGED recovery model, you may perform a minimally-logged operation, which generates hardly any transaction log, but the next log *backup* will be about the same size as if the operation was fully logged - because the log backup will pick up all the data extents modified by the minimally-logged operation.

One point in the discussion was that if you're running in the FULL or BULK_LOGGED recovery models, and you do a full backup after the maintenance operation, and before the log backup, the full backup will contain all the changes made by the maintenance operation, yes, and will clear the log.

No. Never.

A log backup is *ALL* the log generated since the last log backup. If this were not the case, how would log shipping work? You could take a full backup on the log shipping primary and suddenly you've broken the log backup chain and log shipping breaks. No, this is not how things work. A full backup contains only enough transaction log necessary to be able to restore that database to a transactionally consistent time - the time at which the data reading portion of the full backup completed. I blogged about this extensively previously:

But you don't have to believe me - it's very simple to convince yourself. The following script will show you that a full backup has no effect on the transaction log. It does the following:

  • Create a database and put it into the FULL recovery model, with a full backup.
  • Create and populate and index.
  • Take log backup 1 (just to clear things out)
  • Rebuild the index.
  • Take log backup 2.
  • Rebuild the index.
  • Take a full backup.
  • Take log backup 3.

And we will see that log backup #3 is the same size as log backup #2. The full backup will make no difference whatsoever.

Here's the script:

USE master;
GO
DROP DATABASE LogBackupTest;
GO
CREATE DATABASE LogBackupTest;
GO
USE LogBackupTest;
GO

ALTER DATABASE LogBackupTest SET RECOVERY FULL;
GO
BACKUP DATABASE LogBackupTest TO
DISK = 'C:\SQLskills\LogBackupTest_Full1.bak' WITH INIT;
GO

CREATE TABLE t1 (c1 INT IDENTITY, c2 CHAR (8000) DEFAULT 'a');
GO
CREATE CLUSTERED INDEX t1c1 ON t1 (c1);
GO
SET NOCOUNT ON;
GO
INSERT INTO t1 DEFAULT VALUES;
GO 1000

BACKUP LOG LogBackupTest TO
DISK = 'C:\SQLskills\LogBackupTest_Log1.bak' WITH INIT;
GO

-- Rebuild the index to generate some log and get a baseline
ALTER INDEX t1c1 ON t1 REBUILD;
GO
BACKUP LOG LogBackupTest TO
DISK = 'C:\SQLskills\LogBackupTest_Log2.bak' WITH INIT;
GO

-- Now do it again, but take a full backup before the log backup
ALTER INDEX t1c1 ON t1 REBUILD;
GO
BACKUP DATABASE LogBackupTest TO
DISK = 'C:\SQLskills\LogBackupTest_Full2.bak' WITH INIT;
GO
BACKUP LOG LogBackupTest TO
DISK = 'C:\SQLskills\LogBackupTest_Log3.bak' WITH INIT;
GO

And here's the result:

 

Log backup #3 is the same size as log backup #2. It contains all the log generated since log backup #2 was taken. The full backup had no affect whatsoever, because that would break the log backup chain.

If you don't believe me, run the script yourself and you'll see. A full backup does not and cannot affect the transaction log.

This week's been a busy one on the forums and Twitter, with lots of interesting problems people are hitting. One of the things I've noticed is that there are lot's of misconceptions about running repair, so to round out Friday I'm going to run through a list of them for you. Here are the misconceptions, some of which I've had to argue several times with people and eventually resort to 'Look, I wrote the repair code, I'm sorry but you're wrong', which I hate doing:

  • Repair will not cause data loss. It depends. If you have to use REPAIR_ALLOW_DATA_LOSS, then you're going to lose data. That's why the option is named that - seriously.
  • Repair should be run as the default. No. Figure out what's wrong first before deciding what to do about it. If you've got a damaged 1TB clustered index, it's going to get rebuilt by repair. If you don't have an extra 1TB of disk space, it will fail, and then you're back to square one after hours of fruitless effort. You might be able to get away with doing something that doesn't involve taking the database (essentially) offline.
  • You can run repair without running DBCC CHECKDB. No. Repair is an option to one of the consistency-checking commands (DBCC CHECKALLOC, DBCC CHECKTABLE, or DBCC CHECKDB - note that DBCC CHECKFILEGROUP and DBCC CHECKCATALOG don't support repair).
  • As soon as you've run repair, everything's fine. No. You should always run DBCC CHECKDB a second time after running repair, to make sure the first repair fixed everything. Sometimes a corruption prevents some deeper checks being done, and when it's fixed, the next DBCC CHECKDB can run the deeper check and find more corruptions. I call this 'corruption masking'. The other reason is that repair probably just deleted some of your data. What effect is that going to have on the application? What if that's medical patient data? Or insurance records? Or back account details?
  • Repair can always fix everything. No. There are some things that DBCC CHECKDB cannot fix. See CHECKDB From Every Angle: Can CHECKDB repair everything? for the list.
  • Repair is safe to use on system databases. No. It cannot be used on master or tempdb because they cannot be put into single-user mode. You can do it on model, but it's unlikely to have an effect as there aren't any user tables in model (unless you create them) and system tables generally don't get repaired. You can run it on msdb, but it might have strange side-effects. See Is running repair on msdb safe?.
  • You can run repairs online. No. Repairs are always offline, in that the database must be in single-user mode.
  • REPAIR_REBUILD will fix everything. No. REPAIR_REBUILD only fixes problem in nonclustered indexes. In 2005 onwards, REPAIR_FAST does nothing at all.
  • Repairs on a repl Publisher propagate to the Subscribers. No. Anything done by repair is NOT marked for replication. You must reinitialize your Subscribers if you repair a Publisher.
  • Repairs always fix-up constraints. No. It has no idea that constraints exist. After repairing a database with constraints, you should run DBCC CHECKCONSTRAINT to make sure they're still valid.
  • Repairs try to save data. No. It doesn't go out of it's way to delete data, but it doesn't go out of it's way to save data in most cases. 'Delete what's broken and fix up all the links' is my sound-bite explanation of what repair does. Fix things as fast as possible and as provably correct as possible.
  • EMERGENCY mode repair will always work. No. I've seen cases where something broken in the file-system caused it to fail. Don't rely on repair.
  • You can undo repairs. It depends. If you started an explicit transaction, then you can roll everything back if you don't like what it did. People rarely do this though. EMERGENCY mode repair can never be rolled back.

Repair's a dangerous beast and should only be used as a last resort, or to bring a VVVLDB online again potentially much faster than a full restore, when a small amount of data loss can be tolerated. This is just a quick brain-dump of things people get wrong about repair.

Have a great weekend - tomrorow I'll report on last week's survey and kick off a new one. Cheers!

This is a quick follow-on from my Misconceptions around null bitmap size post.

The null bitmap is *always* present in a data record (i.e. records in a heap or the leaf-level of a clustered index), but is optional in index records if all the columns in the index records are not nullable. The misconception is around what happens when a new column is added to the table. The common misconception is that if you have 8 columns in the table (and hence 8 bits in the null bitmap), if you add a ninth column then SQL Server has to go update every record so the null bitmaps all contain 9 bits. (Same misconception applies to adding the 17th, 25th, 33rd, etc column).

This is usually not true. Let's consider the cases:

  • New column is nullable, with a NULL default. The table's metadata records the fact that the new column exists but may not be in the record. This is why the null bitmap also has a count of the number of columns in that particular record. SQL Server can work out whether a column is present in the record or not. So - this is NOT a size-of-data operation - the existing table records are not updated when the new column is added. The records will be updated only when they are updated for some other operation.
  • New column is nullable, with a non-NULL default. This IS a size-of-data operation. The non-NULL default forces all existing records to be updated when the column is added, and so the null bitmap will be updated too.
  • New column is not-nullable (obviously with a non-NULL default). This IS a size-of-data operation, for the same reasons as above.

Hope this helps.

This short post is prompted by a question that came in through Twitter - I *knew* it was worth joining and spending time on it (http://twitter.com/PaulRandal).

The (paraphrased) question is "can FILESTREAM data be stored remotely?". This has been confusing people, and neither FILESTREAM BOL nor my FILESTREAM whitepaper (see here) explicitly answer the question.

The FILESTREAM data container for a database must be placed on an NTFS volume on locally-connected storage. Just like database data and log files, the directory cannot be on a UNC share. The confusion comes from the fact that FILESTREAM data *can* be *accessed* remotely through a UNC share - but as far as the host instance is concerned, the FILESTREAM storage must be local.

A second question that came up a while ago is whether FILESTREAM data containers can share the same directory or be nested. The answers are kind-of, and no, respectively. Let's see.

I'll create the first database with a FILESTREAM data container:

CREATE DATABASE FileStreamTestDB1 ON PRIMARY
    (NAME = FileStreamTestDB1_data, FILENAME = N'C:\SQLskills\FSTestDB1_data.mdf'),
FILEGROUP FileStreamFileGroup CONTAINS FILESTREAM
    (NAME = FileStreamTestDB1Documents, FILENAME = N'C:\SQLskills\FSDC\Documents')
LOG ON
    (NAME = FileStreamTestDB1_log, FILENAME = N'C:\SQLskills\FSTestDB1_log.ldf');
GO
 

And now let's try another database with the same parent directory:

CREATE DATABASE FileStreamTestDB2 ON PRIMARY
    (NAME = FileStreamTestDB2_data, FILENAME = N'C:\SQLskills\FSTestDB2_data.mdf'),
FILEGROUP FileStreamFileGroup CONTAINS FILESTREAM
    (NAME = FileStreamTestDB2Documents, FILENAME = N'C:\SQLskills\FSDC\Documents2')
LOG ON
    (NAME = FileStreamTestDB2_log, FILENAME = N'C:\SQLskills\FSTestDB2_log.ldf');
GO
 

This works fine. You can't have another database use the *same* directory as the first database (i.e. N'C:\SQLskills\FSDC\Documents'), but two FILESTREAM data containers can have the same parent directory.

And now let's try a nested one:

CREATE DATABASE FileStreamTestDB3 ON PRIMARY
    (NAME = FileStreamTestDB3_data, FILENAME = N'C:\SQLskills\FSTestDB3_data.mdf'),
FILEGROUP FileStreamFileGroup CONTAINS FILESTREAM
    (NAME = FileStreamTestDB3Documents, FILENAME = N'C:\SQLskills\FSDC\Documents\Documents3')
LOG ON
    (NAME = FileStreamTestDB3_log, FILENAME = N'C:\SQLskills\FSTestDB3_log.ldf');
GO

Msg 5136, Level 16, State 2, Line 1
The path specified by 'C:\SQLskills\FSDC\Documents\Documents3' cannot be used for a FILESTREAM container since it is contained in another FILESTREAM container.
Msg 1802, Level 16, State 2, Line 1
CREATE DATABASE failed. Some file names listed could not be created. Check related errors.

Doesn't work, as expected, as this is documented in BOL here.

Thanks

This has come up a few times now, most recently in an email question this morning - subsequent runs of DBCC CHECKDB show varying numbers of corruptions, and sometimes no corruptions - what's going on? Even more strange - a maintenance job runs a DBCC CHECKDB, which shows errors, but then in the morning - no consistency errors. What?

I answered this back in the April 2009 SQL Q&A column in TechNet Magazine, but I want to get it here on the blog too in a bit more detail. The answer has to do with the way the database is consistency checked, and how corruptions are detected.

In 2005 onwards, you're going to be using page checksums to help detect corruption. If you created the database on 2005 onwards, page checksums are enabled by default and every allocated page will have one. If you upgraded a database from 2000 or before, then you need to manually enable page checksums with ALTER DATABASE. The nothing happens. Until a page is read in, changed, and then written back out. So your upgraded database will have a mixture of nothing/page checksums, or torn-page detection/page checksums. Note: torn-page protected pages remain torn-page protected, even with page checksums enabled, until the next time they're altered. Then they get a page checksum. See Inside The Storage Engine: Does turning on page checksums discard any torn-page protection? for an explanation and examples.

Once you've got page checksums enabled, who can you tell if there are corruptions in the database? Well, there are a number of ways corruptions will show up:

  1. You run an operation that hits a page that has been corrupted, and the page checksum test fails
  2. You run a BACKUP ... WITH CHECKSUM and it finds a page with a bad checksum
  3. You run a DBCC CHECKDB and it finds a page with a bad checksum

That's all very well, but what if a page *doesn't* have a page checksum on it (because it hasn't been changed since page checksums were enabled)? None of #1 to #3 will fail because of a bad page checksum, as there isn't a page checksum to check. #1 might fail, depending on how corrupt the page is, and it will likely fail with an obscure message that doesn't immediately scream 'corruption'. #2 won't fail, as the only time BACKUP examines what it's backing up is when WITH CHECKSUM is enabled and a page has a page checksum on it. #3 might find the corruption, depending on how the page is corrupt. If the corruption is in the middle of a large varchar field, for instance, probably not. Your best bet is to have page checksums enabled and regularly run DBCC CHECKDB.

That's how corruptions are detected. So what about the disappearing corruptions? This gets into how consistency checks work. Consistency checks only run on the pages in the database that are allocated. If a page isn't allocated to anything, then the 8192 bytes of it are meaningless and can't be interpreted. Don't get confused between reserved and allocated - I explain that in the first misconceptions post here. As long as a page is allocated, it will be consistency checked by DBCC CHECKDB, including testing the page checksum, if it exists. A corruption can seem to 'disappear' if a corrupt page is allocated at the time a DBCC CHECKDB runs, but is then deallocated by the time the next DBCC CHECKDB runs. The first time it will be reported as corrupt, but the second time it's not allocated, so it isn't consistency checked and won't be reported as corrupt. The corruption looks like it's mysteriously vanished. But it hasn't - it's just that the corrupt page is no longer allocated. There's nothing stopping SQL Server deallocating a corrupt page - in fact, that's what many of the DBCC CHECKDB repairs do - deallocate what's broken, and fix up all the links.

The maintenance job phenomenon can occur because of the order of operations in the job. If the DBCC CHECKDB is first, and then there's an index rebuild, and the index rebuild happens to rebuild an index that DBCC CHECKDB had found a corruption in, then the *new* index will have a completely different set of database pages, and won't contain the corrupt page. Bingo - disappearing corruption. A subsequent DBCC CHECKDB might not find any corruption, because the previously corrupt pages are no longer allocated.

Bottom line - any time you get corruption error messages, 99.999% of the time it's your I/O subsystem that's got problems, even if the corruptions 'disappear'.

PS Don't forget to follow along on Twitter - http://twitter.com/PaulRandal

I'm starting a new series called 'Misconceptions' - a series of short posts that debunk some of the many myths and misconceptions that exist about the way SQL Server behaves. I actually already did the first post a couple of weeks back (Misconceptions around TF 1118) but just went back to re-tag it.

In this post I want to debunk the common myth that the null bitmap only contains bits for nullable columns. It doesn't - it has one bit per column in the table definition, as long as at least one column in the table is nullable. The 'unused' bits are always set to 1, which means 'null'.

And now the proof. I'm going to create two tables, with 10 columns each (meaning that the null bitmap has to have two bytes to store all the bits, plus two bytes for the count of columns in the record). The first table will have all nullable columns. The second will have the first 9 columns not nullable, and the tenth column nullable.

CREATE TABLE t1 (
    c1 INT, c2 INT, c3 INT, c4 INT, c5 INT,
    c6 INT, c7 INT, c8 INT, c9 INT, c10 INT);
GO

CREATE TABLE t2 (
    c1 INT NOT NULL, c2 INT NOT NULL, c3 INT NOT NULL,
    c4 INT NOT NULL, c5 INT NOT NULL, c6 INT NOT NULL,
    c7 INT NOT NULL, c8 INT NOT NULL, c9 INT NOT NULL,
    c10 INT);
GO

INSERT INTO t1 VALUES (1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
GO

INSERT INTO t2 VALUES ( 1, 2, 3, 4, 5, 6, 7, 8, 9, NULL);
GO

And now let's look at the pages themselves. I'll get the pages involved using my sp_AllocationMetadata script (see Inside The Storage Engine: sp_AllocationMetadata - putting undocumented system catalog views to work):

EXEC sp_AllocationMetadata 't1';
EXEC sp_allocationMetadata 't2';
GO

Object Name  Index ID  Alloc Unit ID      Alloc Unit Type  First Page  Root Page  First IAM Page
------------ --------- ------------------ ---------------- ----------- ---------- ---------------
t1           0         72057594042318848  IN_ROW_DATA      (1:152)     (0:0)      (1:153)

Object Name  Index ID  Alloc Unit ID      Alloc Unit Type  First Page  Root Page  First IAM Page
------------ --------- ------------------ ---------------- ----------- ---------- ---------------
t2           0          2057594042384384  IN_ROW_DATA      (1:154)     (0:0)      (1:155)

And now dump them with DBCC PAGE to look at the row structure:

DBCC TRACEON (3604);
GO
DBCC PAGE ('NullTest', 1, 152, 3);
DBCC PAGE ('NullTest', 1, 154, 3);
GO

-- Record dump from t1, all ten columns nullable and not NULL

Record Type = PRIMARY_RECORD         Record Attributes =  NULL_BITMAP
Memory Dump @0x684EC060

00000000:   10002c00 01000000 02000000 03000000 †..,.............
00000010:   04000000 05000000 06000000 07000000 †................
00000020:   08000000 09000000 0a000000 0a0000fc †................

-- Record dump from t2, first nine columns not nullable, tenth column nullable and NULL

Record Type = PRIMARY_RECORD         Record Attributes =  NULL_BITMAP
Memory Dump @0x684EC060

00000000:   10002c00 01000000 02000000 03000000 †..,.............
00000010:   04000000 05000000 06000000 07000000 †................
00000020:   08000000 09000000 21212121 0a0000fe †........!!!!....

If you try this yourself, you may get a different bit pattern for the NULL c10 (I got 0x21212121). It's a bit of a crap-shoot depending on what memory SQL Server reuses to create the record in memory before writing to the page - the 4 bytes of the NULL column aren't overwritten from the previous usage of the memory (and don't need to be).

In the first record, you can see that the null bitmap (underlined and in bold) says there's 10 columns (the 0x0a00 gets byte-reversed into 0x000a) and the null bitmap is 0xfc00, which is 1111110000000000 - 6 unused bits all set to 1 and 10 not-NULL column values.

In the first record, you can see that the null bitmap (underlined and in bold) says there's 10 columns again and the null bitmap this time is 0xfe00, which is 1111111000000000 - 6 unused bits all set to 1, 1 column NULL and 9 columns not-NULL. Column 1 is the far right bit, so column 10 is the tenth from right (the first '1').

This clearly shows that all columns are represented in the null bitmap even if only one of the columns is nullable. It's the same if on the first column in the table is nullable, I'll leave that for you to prove to yourself.

Next up: the next misconception I come across!

There's been a recent flurry of confusion and misconceptions about trace flag 1118 in SQL Server 2008. This trace flag switches allocations in tempdb from single-page at a time for the first 8 pages, to immediately allocate an extent (8 pages). It's used to help alleviate allocation bitmap contention in tempdb under a heavy load of small temp table creation and deletion.

There are multiple points of confusion, which I'll address in turn. Then I'll prove that the trace flag still works in SQL Server 2008.

1) Why is the trace flag usually required in 2000? In SQL 2000, whenever a temp table is created in tempdb and a row inserted, an IAM page must be allocated and a single data page must be allocated. These two pages are both 'single-page' allocations, from a mixed extent (see Inside The Storage Engine: Anatomy of an extent for more info). This means that an SGAM allocation bitmap page must be accessed, and a PFS page must be accessed (see Inside The Storage Engine: GAM, SGAM, PFS and other allocation maps for more info).

With lots of very small temp tables being created, this means the very first SGAM page and the very first PFS page in the data file are accessed/changed by all the threads, leading to latch contention problems on these two pages. When the temp tables are deleted again, the various pages are deallocated, which again needs to access and change the PFS page, and potentially the SGAM page.

There are two ways to alleviate this problem. Firstly, create multiple data files in tempdb - which splits the latch contention over multiple allocation bitmaps (from having allocations come from multiple files) and thus reduces the contention. The general rule of thumb was one tempdb data file for each processor core. Secondly, turn on TF1118, which makes the first 8 data pages in the temp table come from a dedicated extent. This means one extent is allocated from the GAM page, rather than 8 single pages (and potentially 8 accesses to the SGAM page). The pages within the extent are reserved and allocated singly from this extent, as needed. This also cuts down on contention and is documented in KB 328551.

2) What does reserved vs. allocated mean? When an extent is allocated to a table, the 8 pages in the extent are not immediately allocated as well. Allocating an extent means those 8 pages are reserved exclusively for subsequent allocation to that table. The pages are allocated individually as needed, but no other table can allocate them. This is why such extents are called 'dedicated' extents (see my blog post link above for more details). You can see the counters of reserved pages vs. allocated pages in the output from sp_spaceused.

3) Why is the trace flag not required so much in 2005 and 2008? In SQL Server 2005, my team changed the allocation system for tempdb to reduce the possibility of contention. There is now a cache of temp tables. When a new temp table is created on a cold system (just after startup) it uses the same mechanism as for SQL 2000. When it is dropped though, instead of all the pages being deallocated completely, one IAM page and one data page are left allocated, and the temp table is put into a special cache. Subsequent temp table creations will look in the cache to see if they can just grab a pre-created temp table 'off the shelf'. If so, this avoids accessing the allocation bitmaps completely. The temp table cache isn't huge (I think it's 32 tables), but this can still lead to a *big* drop in latch contention in tempdb.

4) Does the trace flag still exist in 2005 and 2008? Yes it does - KB 328551 clearly states:

Note Trace flag -T1118 is also available and supported in Microsoft SQL Server 2005 and SQL Server 2008. However, if you are running SQL Server 2005 or SQL Server 2008, you do not have to apply any hotfix.

Just to make extra-sure (as I'm always paranoid about saying absolutes), I checked with my good friend Ryan Stonecipher, who's the dev lead for the team that owns allocation (and a bunch of other stuff, including DBCC). He confirmed the code is exactly the same in 2008 as it was in 2005. And I prove it to you below too.

5) And why is it sill there in 2005 and 2008? It does the same thing in 2005/2008 as it did in 2000. If the temp table creation/deletion workload is high enough, you can still see latch contention, as the temp table cache won't be enough to completely alleviate the need for creating actual new temp tables, rather than just being able to grab one 'off the shelf'. In that case, using the trace flag to change to extent-based allocation (in *exactly* the same way as for 2000) can help, as can creating more tempdb data files.

As far as data files go though, the number has changed. Instead of a 1-1 mapping between processor cores and tempdb data files (*IF* there's latch contention), now you don't need so many - so the recommendation from the SQL team is the number of data files should be 1/4 to 1/2 the number of processor cores (again, only *IF* you have latch contention). The SQL CAT team has also found that in 2005 and 2008, there's usually no gain from having more than 8 tempdb data files, even for systems with larger numbers of processor cores. Warning: generalization - your mileage may vary - don't post a comment saying this is wrong because your system benefits from 12 data files. It's a generalization, to which there are always exceptions.

6) Why does DBCC IND still show two pages, even with the trace flag on? I've heard of some people being confused by the output of DBCC IND in SQL 2008 when the trace flag is turned on. Creating a single row temp table will only show two pages allocated in the DBCC output - one IAM page and one data page. Yes, that's completely correct - as only two pages are allocated, but the data page comes from a dedicated extent, not a mixed extent. (IAM pages are *always* single-page allocations from mixed-extents).

And now the proof, on SQL 2008.

SELECT @@VERSION;
GO

Microsoft SQL Server 2008 (RTM) - 10.0.1600.22 (Intel X86)

First off, I'll create a temp table without the trace flag enabled, and see what pages the table has allocated, by looking at the first IAM. I'll use a temp table with an 8000+ byte row size, and insert two rows - so we have two data pages for clarity.

DBCC TRACEOFF (1118, -1);
GO

USE tempdb;
GO

CREATE TABLE #temp (c1 INT IDENTITY, c2 CHAR (8000) DEFAULT 'a');
GO
INSERT INTO #temp DEFAULT VALUES;
GO 2

Now I'll figure out what is the first IAM page, using my sp_AllocationMetadata script (see here for the script and details), and dump it with DBCC PAGE to see the single-page allocations it's tracking, and which dedicated extents are allocated to the table:

EXEC sp_AllocationMetadata '#temp';
GO

Object Name    Index ID  Alloc Unit ID        Alloc Unit Type  First Page  Root Page  First IAM Page
-------------- --------- -------------------- ---------------- ----------- ---------- ---------------
#temp__<snip>  0         1152921505223016448  IN_ROW_DATA      (1:158)     (0:0)      (1:199)

DBCC TRACEON (3604);
GO
DBCC PAGE ('tempdb', 1, 199, 3);
GO

<snip>

IAM: Single Page Allocations @0x4A35C08E

Slot 0 = (1:158)                     Slot 1 = (1:200)                     Slot 2 = (0:0)
Slot 3 = (0:0)                       Slot 4 = (0:0)                       Slot 5 = (0:0)
Slot 6 = (0:0)                       Slot 7 = (0:0)                      


IAM: Extent Alloc Status Slot 1 @0x4A35C0C2

(1:0)        - (1:1016)     = NOT ALLOCATED   

As you can clearly see from the partial output of the dump of the IAM page, there are two single-page allocations and no extents allocated to the temp table. This is what should happen when the trace flag is not enabled.

Now I'll do the same thing with the trace flag 1118 enabled.

USE tempdb;
GO

DROP TABLE #temp;
GO

DBCC TRACEON (1118, -1);
GO

CREATE TABLE #temp (c1 INT IDENTITY, c2 CHAR (8000) DEFAULT 'a');
GO
INSERT INTO #temp DEFAULT VALUES;
GO 2

EXEC sp_AllocationMetadata '#temp';
GO

Object Name    Index ID  Alloc Unit ID        Alloc Unit Type  First Page  Root Page  First IAM Page
-------------- --------- -------------------- ---------------- ----------- ---------- ---------------
#temp__<snip>  0         1224979099301904384  IN_ROW_DATA      (1:208)     (0:0)      (1:158)

DBCC TRACEON (3604);
GO
DBCC PAGE ('tempdb', 1, 158, 3);
GO

<snip>

IAM: Single Page Allocations @0x4A8FC08E

Slot 0 = (0:0)                       Slot 1 = (0:0)                       Slot 2 = (0:0)
Slot 3 = (0:0)                       Slot 4 = (0:0)                       Slot 5 = (0:0)
Slot 6 = (0:0)                       Slot 7 = (0:0)                      


IAM: Extent Alloc Status Slot 1 @0x4A8FC0C2

(1:0)        - (1:200)      = NOT ALLOCATED                              
(1:208)      -              =     ALLOCATED                              
(1:216)      - (1:1016)     = NOT ALLOCATED
                              

Now as you can clearly see, there are no single-page allocations, and there's a single extent allocated to the table. Proof that trace flag 1118 still does exactly what it should in SQL Server 2008.

Now for a DBCC IND on the table: 

DBCC IND ('tempdb', '#temp', -1);
GO

PageFID PagePID     IAMFID IAMPID      ObjectID    IndexID    
------- ----------- ------ ----------- ----------- -----------
1       158         NULL   NULL        293576084   0
1       208         1      158         293576084   0
1       209         1      158         293576084   0

(I've removed some of the trailing columns for clarity.) We see that it still only lists the two data pages (1:208, 1:209) and the IAM page (1:158) - although an entire extent was allocated to the temp table, only two pages from the extent were actually allocated and used - the rest are reserved for use by that table, but remain unallocated. 

Hopefully this post has cleared up a lot of the confusion around this trace flag and what it does.

Theme design by Nukeation based on Jelle Druyts