Three weeks ago I kicked off a survey about using database snapshots - see here - because I don't really know how much they're used or not out in the wild. Here are the results:

 

The "Other" values are:

  • 13 x "No - not at the moment."
  • 10 x "In our test environment for reverting to the original database state after a test. Much faster than restoring every time."
  • 6 x "For development, to be able to go back to a previous known state after trying out new scripts and stuff."
  • 5 x "Haven't really looked at using them."
  • 5 x "No - just read about them recently and have yet to determine usefulness."
  • 3 x "Yes - every time I do a DBCC CHECKDB (okay, it's hidden, but still a snapshot)."
  • 2 x "During application installation in order to do an easy rollback if smoke testing fails."
  • 2 x "For rollback of code/data releases."
  • 2 x "Never used, but did recommend once as a pre-deployment option so that there was a faster rollback in case a developer's script broke once in Prod. Recommendation was not implemented."
  • 1 x "As failsafe for monthly production deployments."
  • 1 x "For EOM reporting copies of the production database."
  • 1 x "No. Had performance issues so we turned them off."
  • 1 x "Not supported in SQL2000."
  • 1 x "When adding articles in replication."
  • 1 x "Yes - For quick rollback during production changes."
  • 1 x "Yes for ETL, and via a linked server connection since snapshot isolation isn't an option."
  • 1 x "Yes, for instantaneous restore of development environment."
  • 1 x "Yes, rarely - for investigating issues on copy of production DB, way faster than restore."

Very interesting! More than 60% of respondents don't or can't use database snapshots. I've always thought that database snapshots are an under-utilized feature, I'm not entirely sure why though (apart from the obvious Enterprise Edition restriction). It could be some of the limitations of them, such as:

  • You can't refresh one without all connections to it being dropped (and then dropping and recreating it)
  • You can't back up a snapshot (would require some interesting plumbing in the Storage Engine, as currently backups don't use the buffer pool, and it's the buffer pool that manages what's in a database snapshot)
  • You can't modify a snapshot (i.e. like having a base VM with various diff disks)
  • They can have a performance impact on the production/mirror workload depending on how heavily they're used

I'm really glad that no-one confessed to using database snapshots as an alternative to log backups! See this TechNet Magazine Q&A column I wrote last year that explains why not.

Rather than repeat what's already been said about why database snapshots can be useful, I'm going to provide you with some links to further reading:

  • Top-level of Books Online on database snapshots
  • Books Online on typical uses of a database snapshot
  • Whitepaper on performance implications of using database snapshots
  • Blog post on what happens to a database snapshot when a transaction rolls back
  • Blog post on some of the problems you can see from DBCC CHECKDB using database snapshots under the covers
  • Blog post on a bunch of potential problems when a database snapshot fills up or becomes corrupt
  • Blog post about memory usage from a database snapshot

Enjoy!

Categories:
Database Snapshots | Surveys

As I was writing a quick SQL Server Magazine blog post on database snapshots while waiting for my coffee to brew, I figured it would make an interesting survey to see whether people are making use of database snapshots at all. I don't expect to see much activity on this until tomorrow as it's July 4th here in the USA.

Note: As Books Online says: don't get database snapshots confused with snapshot isolation, snapshot replication, or snapshot backups.

I'll report on the results in a few weeks or so.

Thanks!

Categories:
Database Snapshots | Surveys

The May edition of TechNet Magazine is available on the web now and has the latest installment of my regular SQL Q&A column.

This month's topics are:

  • How to change the file locations for a user database
  • The difference between page latches and page I/O latches
  • Why database snapshots are not a good substitute for transaction log backups
  • Why I/O subsystem-caused corruption does not propagate to a database mirror

Check it out at http://technet.microsoft.com/en-us/magazine/ff679223.aspx.

(Look in the Misconceptions blog category for the rest of the month's posts and check out the 60-page PDF with all the myths and misconceptions blog posts collected together: CommonSQLServerMyths.pdf (732.96 kb))

The month is finally over so time for the grand finale!

Although it's been fun debunking all these myths, it's been a tad stressful making sure I come up with an interesting and useful myth to debunk every day. I'd like to give kudos to fellow-MVP Glenn Berry (blog|twitter) who's been running an excellent DMV-a-Day series through April too!

To round out the month, I present to you 30 myths around backups - one for each day of the month of April. Last night I sat down to write this post and was a few myths short so reached out to the fabulous SQL community on Twitter (follow me!) for help - too many people to list (you know who you are) - I thank you!

A few folks have asked if I'll pull the month's posts into a PDF e-book - let me know if you'd like that.

I *really* hope you've enjoyed the series over the last month and have had a bunch of myths and misconceptions debunked once and for all - I know quite a few of you are going to use these explanations as ammunition against 3rd-party vendors, developers, and other DBAs who insist on incorrect practices.

Ok - here we go with the last one...

Myth #30: various myths around backups...

All are FALSE!!

For a good primer on understanding backups and how they work see my TechNet Magazine article Understanding SQL Server Backups. 

30-01) backup operations cause blocking

No. Backup operations do not take locks on user objects. Backups do cause a really heavy read load on the I/O subsystem so it might *look* like the workload is being blocked, but it isn't really. It's just being slowed down. There's a special case where a backup that has to pick up bulk-logged extents will take a file lock which could block a checkpoint operation - but DML is never blocked.

30-02) switching from the FULL recovery model to the BULK_LOGGED recovery model and back again breaks the log backup chain

No. It just doesn't. Switching from either FULL or BULK_LOGGED to SIMPLE *does* break the log backup chain however.

30-03) breaking the log backup chain requires a full backup to restart it

No. You can restart the log backup chain with either a full or differential backup - anything that bridges the LSN gap from the point at which the log backup chain was broken. See my blog post A SQL Server DBA myth a day: (20/30) restarting a log backup chain requires a full database backup for more details.

30-04) concurrent log backups are not possible while a full or differential backup is in progress 

No, this changed in SQL Server 2005. See my blog post Search Engine Q&A #16: Concurrent log and full backups.

30-05) a full or differential backup clears the log

No. A log backup includes all the log since the last log backup - nothing can change that - no matter whether that log was also backed up by a full or differential backup. I had a famous argument on Twitter last year and wrote this blog post as proof: Misconceptions around the log and log backups: how to convince yourself. In the FULL or BULK_LOGGED recovery models, the *only* thing that clears the log is a log backup.

30-06) using the BULK_LOGGED recovery model for minimally-logged operations reduces the size of the next transaction log backup

No. A minimally-logged operation is so-named because only the page allocations are logged. A log backup needs all the information necessary to resconstitute the transaction, so a log backup following a minimally-logged operation must backup the log plus all extents changed by the minimally-logged operation. This will result in the log backup being roughly the same size as if the operation was fully logged.

30-07) full and differential backups only contain the log generated while the backup was running

No. A full or differential backup contains enough log to be able to recover the database to a transactionally-consistent view of the database at the time the data-reading portion of the backup finished (or as far back as the oldest log record that transactional replication has not yet processed - to ensure that replication works properly after a restore). Check out these two blog posts for details:

30-08) backups always test existing page checksums

No. It only does it when you use the WITH CHECKSUM option - which you should.

30-09) backups read data through the buffer pool

No. The backup subsystem opens its own channels to the database files to avoid the performance hit of having to read everything into SQL Server's memory and back out to the backup device (and also effectively flushing the buffer pool in the process). If you ask the for page-checksum checking, it uses it's own small portion of memory.

30-10) backups perform consistency checks (a la DBCC CHECKDB)

No. Nothing else to say.

30-11) if the backup works, the restore will too

No. Please don't fall into this trap. You must regularly validate your backups to give yourself a high level of confidence that they will work if a disaster occurs. See Importance of validating backups for more details.

30-12) a mirrored backup will succeed if the mirror location becomes unavailable

No. If any one of the paths to a mirrored backup fails, the entire mirrored backup operation fails. I'd really like it to work the other way around - where the local backup succeeds and the remote backups fail, but it doesn't unfortunately.

30-13) a tail-of-the-log backup is always possible

No. A tail-of-the-log backup is one that backs up all the log generated since the last log backup, in an exceptional situation. If the data files are damaged, you can still do a tail-of-the-log backup EXCEPT if the un-backed-up log contains a minimally-logged operation. That would require reading data extents - which cannot be done if the data files are damaged. For this reason, the BULK_LOGGED recovery model should not be used on databases that have 24x7 user transactions.

30-14) you can use backups instead of DBCC CHECKDB

No. See A SQL Server DBA myth a day: (27/30) use BACKUP WITH CHECKSUM to replace DBCC CHECKDB.

30-15) you can backup a database snapshot

No. It's not implemented, but would be great if you could.

30-16) you can use database snapshots instead of log backups

No. A database snapshot is only usable while the database on which it is based is usable and online. If the source database is corrupted, the database snapshot most likely is too. If the source database goes suspect, so does the database snapshot.

Also, having multiple database snapshots (equating to multiple log backups) incurs an increasing performance drain - as every page that changes in the source database may need to be synchronously written to all existing snapshots before it can be written to the source database data files, and all existing database snapshots will grow as more pages are pushed into them.

30-17) log backups will be the size of the log

No. The log has to accomodate the space necessary to roll back active transactions, the amount of space returned by DBCC SQLPERF (LOGSPACE) on a busy system doesn't accurately refect the amount of log records in the log. This blog spot explains: Search Engine Q&A #25: Why isn't my log backup the same size as my log? And apart from that, a log backup is just all the log generated since the last log backup - not the whole log file usually - and if it happens to be, the first part of the explanation comes into play.

30-18) you cannot backup a corrupt database

No. In most cases you can use the WITH CONTINUE_AFTER_ERROR option to back up the corrupt database.  If that fails (maybe because of a damaged boot page or file-header page), there are no other options apart from OS-level file backups.

30-19) you cannot stop someone doing a BACKUP LOG .. WITH NO_LOG or TRUNCATE_ONLY operation

No. In SQL Server 2008 it's not possible any more (yay!) and in 2005 and before, use trace flag 3231 which turns the operation into a no-op.

30-20) log backups always clear the log

No.

If there's no concurrent data backup running, a log backup will always *try* to clear the log, and only succeed in clearing the inactive portion of the log - the log that's only considered 'required' by SQL Server because it hasn't yet been backed up. If anything else is holding the log 'required', it cannot be cleared, even though it has been backed up. Subsequent log backups will check again and again until the time comes when that portion of the log can be cleared. The TechNet Magazine article Understanding Logging and Recovery in SQL Server I wrote last year explains a lot more about how the log works.

Also, if there is a concurrent data backup running, the log clearing will be delayed until the data backup finishes. See the blog post in myth 30-04 for more details.

30-21) differential backups are incremental

No. Differential backups are all the data extents that have changed since the last full backup - so they are cumulative. Log backups are incremental - all log generated since the last log backup. Many people call differential backups 'incrementals', when they're not really.

30-22) once a backup completes, you can safely delete the previous one

No. No. No.

If you go to restore, and you find your full backup is corrupt, what do you do? Well, if you don't have an older full backup, you most likely start updating your resume. You need to keep a rolling-window of backups around in case a disaster occurs and you need to restore from an older set of backups.

30-23) you can back up a mirror database

No. A mirror database is not accessible except through a database snapshot. And you can't back up that either.

30-24) you can back up a single table

No. You can effectively back up single table if it happens to be wholely contained on a single filegroup, but there's no way to say BACKUP TABLE.

30-25) SQL Server has to be shut down to take a backup

No. No idea how this myth started... [Edit: apparently this myth started with Oracle - and we all know how good Oracle is compared to SQL Server... :-)]

30-26) my transaction is guaranteed to be contained in the backup if it committed before the backup operation completed

No. The commit log record for the transaction has to have been written out before the data-reading portion of the backup finished. See my blog post Search Engine Q&A #6: Using fn_dblog to tell if a transaction is contained in a backup for more details.

30-27) you should shrink the database before a backup to reduce the backup size

No. Shrink just moves pages around so won't make any difference. See my old blog post Conference Questions Pot-Pourri #10: Shrinking the database before taking a backup. And of course, shrink is evil. See A SQL Server DBA myth a day: (9/30) data file shrink does not affect performance. And what's even worse as someone reminded me, is if you do the shrink *after* the full backup, the next differential backup may be huge, for no actual data changes!

30-28) backups are always the best way to recover from a disaster

No. Backups are usually the best way to recover with zero data-loss (as long as you have log backups up to the point of the disaster), but not necessarily the best way to recover with minimal downtime. It may be way faster to fail over, or to run repair and accept some data loss if the business requirements allow it.

30-29) you don't need to back up master, msdb, model...

No. You should always back up the system databases. Master contains all the security info, what databases exist - msdb contains all the SSIS packages, Agent jobs, backup history - model contains the configuration for new databases. Don't fall into the trap of only backing up user databases otherwise you'll be in a world of hurt if you have to do a bare-metal install.

30-30) you should always plan a good backup strategy

No. Now you're thinking 'Huh?'...

You should plan a restore strategy. Use the business requirements and technical limitations to figure out what you need to be able to restore in what time, and then use that to figure out what backups you need to take to allow those restores to happen. See the blog posts:

The vast majority of the time people plan a backup strategy without testing or thinking about restores - and come a disaster, they can't restore within their SLAs. Don't let that be you.

(Look in the Misconceptions blog category for the rest of the month's posts and check out the 60-page PDF with all the myths and misconceptions blog posts collected together: CommonSQLServerMyths.pdf (732.96 kb))

It's the last day of our class here in Boston and I'm teaching all day today, so I need to bang out my Debunk-A-Myth-A-Day blog post before heading down from our hotel room. Incidentally, we've had fellow MVP Aaron Bertrand (Twitter|blog) in the class this week. He's a fine, very knowledgeable fellow and has added a lot to the discussions from his experiences and earlier in the week he wrote a blog post describing his views on the class, and how hard it can be to take a week off production work.

But I digress (as usual).

Myth #2: DBCC CHECKDB causes blocking because it takes locks by default.

FALSE

In 7.0 and before, DBCC CHECKDB (and the rest of the family of consistency checking commands) were a nasty mess of nested loop C code that took table locks (and the nested loops made the algorithm essentially order-n-squared, for the programmers amongst you). This was not good, and so...

In 2000, a guy called Steve Lindell (who's still on the SQL team) rewrote DBCC CHECKDB to get a consistent view of the database using transaction log analysis. Essentially DBCC CHECKDB would prevent log truncation and then at the end of reading through the inconsistent (because of concurrent user transactions) database, run crash recovery on the transaction log inside itself. Basically, there was a brand new reproduction of the recovery code, but inside DBCC CHECKDB. I helped write a bunch of the log analysis code - tortuous, but fun. No, more tortuous. And there were some little problems with it - like the possibility of false failures... "if it gave errors, run it again and see if you get the same errors". Occasionally it would take table SCH_S locks (schema-stability locks) that would only block table scans and table schema modifications. The logging code was overall not good, and so...

In 2005, a guy called Paul Randal rewrote DBCC CHECKDB again to use database snapshots to get the consistent view of the database (as a database snapshot automatically provides a transactionally-consistent, point-in-time view of the database). No more nasty transaction log analysis code, not more locks *at all* - as accesses to the source database of a database snapshot never take locks - the buffer pool manages all the possibilities of race conditions.

You can read more on the internals of this stuff (both 2000 and 2005+) in the following posts:

Now, in all versions, if you use the WITH TABLOCK option, DBCC CHECKDB will take locks to guarantee a transactionally consistent view, but I don't recommend doing that. The first thing it will try to do is grab an exclusive database lock, which in the vast majority of cases will fail (it only waits 20 seconds) because of concurrent database connections.

In 2000, the fact that it prevented log truncation could cause some issues - like the log having to grow - and in 2005, there can be issues around the use of database snapshots (see links above).

But by default, DBCC CHECKDB has been blocking-free since SQL Server 2000.

This is a quick post to clarify an article I saw on SQLServerCentral this morning that seemed to state that transaction rollbacks push data into database snapshots. This is absolutely not true.

A database page is copied into a database snapshot before it is changed in the source database. Although the mechanism is commonly called copy-on-write, it's more technically accurate to call it copy-before-write (but this makes it a bit harder to understand for many people). Once a page has been copied into the database snapshot, it is never removed from the database snapshot, and won't ever be copied into it again, as the database snapshot already has the correct point-in-time copy of the updated page. (For more info on database snapshots in general, see the Books Online entry Database Snapshots.)

A transaction makes one or more changes to the database, updating one or more pages. These pages will be copied into the database snapshot if they're not already there, so the pre-change image (that existed at the time the database snapshot was created) is preserved.

If the transaction rolls back, the rollback occurs by generating the reverse operations that the transaction performed and applying them to the database (e.g. an insert will be rolled back by the generation and application of a delete; an update will be rolled back by replacing the updated parts of the record with the pre-update values). I'll explain more about this in a future blog post.

These rollback operations will occur on the same pages that the initial transaction operations occured on. This means that no other pages will be changed by the rollback operations and so no further pages will be copied into the database snapshot by a rollback. Pages can't be removed from the database snapshot when the transaction rolls back because they have still changed in the source database (although the net effect of the transaction+rollback is no logical changes to the data, the page headers will have changed to have an updated Log Sequence Number on), and so the copy in the database snapshot is still required to preserve the point-in-time view of the database (at the physical level) as of the time the database snapshot was created.

I'll prove this to you with a simple script that you can play around with to convince yourself also.

USE master;
GO

DROP DATABASE SnapRollbackTest_Snapshot;
GO
DROP DATABASE SnapRollbackTest;
GO

CREATE DATABASE SnapRollbackTest;
GO
USE SnapRollbackTest;
GO

CREATE TABLE MyTable (c1 INT);
CREATE CLUSTERED INDEX MyTable_CL ON MyTable (c1);
GO

SET NOCOUNT ON;
GO

DECLARE @a INT;
SELECT @a = 1;
WHILE (@a < 100001)
BEGIN
    INSERT INTO MyTable (c1) VALUES (@a);
    SELECT @a = @a + 1;
END;
GO

CREATE DATABASE SnapRollbackTest_Snapshot ON
    (NAME = N'SnapRollbackTest', FILENAME = N'C:\SQLskills\test\SnapRollbackTest.mdfss')
AS SNAPSHOT OF SnapRollbackTest;
GO

-- Initial size
SELECT size_on_disk_bytes AS [Initial Size (bytes)] FROM sys.dm_io_virtual_file_stats (DB_ID ('SnapRollbackTest_Snapshot'), 1);
GO

-- Start transaction
BEGIN TRAN
GO
UPDATE MyTable SET c1 = 42;
GO
CHECKPOINT; -- to make sure absolutely everything is flushed to disk
GO

SELECT size_on_disk_bytes AS [After Transaction (bytes)] FROM sys.dm_io_virtual_file_stats (DB_ID ('SnapRollbackTest_Snapshot'), 1);
GO

-- Rollback
ROLLBACK TRAN;
GO

SELECT size_on_disk_bytes AS [After Rollback (bytes)] FROM sys.dm_io_virtual_file_stats (DB_ID ('SnapRollbackTest_Snapshot'), 1);
GO

Initial Size (bytes)
--------------------
196608

After Transaction (bytes)
-------------------------
1835008

After Rollback (bytes)
----------------------
1835008

You can clearly see that the size of the database snapshot did NOT increase at all because of the transaction rollback. Using the script above you can try this using a heap, clustered index, various combinations of row size and number of rows - the result will be the same - the database snapshot will not increase in size because of a transaction rollback. I tried a bunch of different combinations, all with the same result.

In the back of my head there's a niggly feeling that there's a funky, rare, pathalogical case where some weird combination of operations results in a page split when rolled-back, but I can't engineer it.

Bottom line - transaction rollbacks do not cause the database snapshot to increase in size, as the rollback operates on the database pages that have already been copied into the snapshot because they changed due to the operations of the transaction itself.

Hope this helps!

PS As one of the commenters pointed out, the initial snapshot size can be affected by the crash-recovery that is run when the snapshot is created. I go into details on that process in this post as it can be confusing when CHECKDB runs.

The 35-page whitepaper on high availability I wrote for the SQL team over the summer has been published on MSDN. It's a 2-300 level whitepaper that describes the various high-availability technologies in SQL Server 2008 and how they can be used to mitigate disasters. It's chock-full of links to other whitepapers, technical articles and Books Online sections and also presents my methodology for planning a high-availability strategy.

You can get it at http://msdn.microsoft.com/en-us/library/ee523927.aspx.

Enjoy!

Here's the table of contents:

  • Introduction
  • Causes of Downtime and Data Loss
    • Planned Downtime
    • Unplanned Downtime and Data Loss
  • Planning a High-Availability Strategy
    • Requirements
    • Limitations
    • Technology Evaluation
  • SQL Server 2008 High-Availability Technologies
    • Logging and Recovery
    • Backup, Restore, and Related Technologies
      • Partial Database Availability and Online Piecemeal Restore
      • Instant File Initialization
      • Mirrored Backups
      • Backup Checksums
      • Backup Compression
    • Other Single-Instance Technologies
      • Online Operations
      • Database Snapshots
      • Hot-Add Memory and CPU
      • Resource Governor
    • Multi-Instance Technologies
      • Log Shipping
      • Transactional Replication
      • Database Mirroring
      • Failover Clustering
      • Combining Multi-Instance Technologies
      • Virtualization
  • Mitigating the Causes of Downtime and Data Loss
  • High-Availability Features Supported by SQL Server 2008 Editions
  • Conclusion

There are a couple of issues that I've heard of in the last few weeks (one while onsite at a customer) and I think they might bite some people so I'd like to share them with you.

DBCC CHECKDB in 2005 onwards uses a hidden database snapshot to create the transactionally-consistent point-in-time view of the database that it requires to run the consistency checks. The hidden database snapshot is created as a set of NTFS alternate streams on the existing database data files. The alternative to having DBCC CHECKDB do this automatically is to manually create your own database snapshot and run DBCC CHECKDB against that - it's the same thing really.

More info on DBCC CHECKDB's use of snapshots, and potential problems can be found at:

The two issues that I've heard of both are around an inability of DBCC CHECKDB to create the hidden snapshot. In that case it is forced to use locks to stabilize the database, which usually fails because the exclusive database lock required for running the allocation checks portion cannot be acquired.

The first issue is around the permissions of the SQL Server service account. To be able to create the NTFS alternate streams, the service account must have the privileges to create files in the DATA directory of the SQL Server instance. This is a really difficult problem to track down as the actual NTFS failure message is not surfaced by the snapshot creation code.

The second issue is around the use of HP PolyServe. Upgrading to Matrix Server 3.6.1 disables support for alternate streams in the filesystem, effectively breaking DBCC CHECKDB. Here's the paragraph from the 3.6.1 upgrade guide (available here):

In previous releases, MxDB for SQL Server provided ADS support internally for use with various SQL Server features such as the DBCC CHECKDB command. This internal support has been removed in HP PolyServe Software for Microsoft SQL Server. Instead, after all servers are upgraded to 3.6.1, you will need to enable ADS support on all filesystems previously used with MxDB for SQL Server. During the upgrade to 3.6.1, SQL Server operations requiring ADS will fail, as the new ADS support feature is not yet in place on the nodes running 3.6.1. For continuity of SQL Server operations, it is important to upgrade all nodes to 3.6.1 and upgrade filesystems for ADS as quickly as possible.

Enabling support after the upgrade means running the PolyServe psfscheck command (which I believe just runs the NTFS fsutil command under the covers), which unfortunately means taking the volume momentarily offline.

Hope this helps!

Recently there's been a spate of people noticing strange behavior from active transactions when DBCC CHECKDB (or any of the other DBCC consistency checking commands run).

For example, I've create a database call DbccTest with a single table. In one connection I do:

BEGIN TRAN
INSERT INTO t1 VALUES (1, 1);
GO

And in another connection I do:

DBCC CHECKDB (DbccTest) WITH ALL_ERRORMSGS, NO_INFOMSGS;
GO
 

Look what gets printed in the error log:

2009-05-04 16:03:21.55 spid54      1 transactions rolled back in database 'DbccTest' (14). This is an informational message only. No user action is required.
2009-05-04 16:03:21.91 spid54      DBCC CHECKDB (DbccTest) WITH all_errormsgs, no_infomsgs executed by ROADRUNNERPR\paul found 0 errors and repaired 0 errors. Elapsed time: 0 hours 0 minutes 0 seconds.

I've highlighted the weird part in bold - it looks like crash recovery ran on the DbccTest database. What's going on?

Well, crash recovery *DID* run on the DbccTest database - only it ran crash recovery into a hidden database snapshot. In the first part of the very long (well, 13 pages - not as long as the 70 page description in the 2008 internals book) post on how DBCC CHECKDB works (see CHECKDB From Every Angle: Complete description of all CHECKDB stages), I explain why DBCC CHECKDB needs a transactionally-consistent, point-in-time view of the database to run consistency checks on. In 2005 I changed the code to use a database snapshot - which by it's very nature provides a point-in-time, transactionally-consistent view of the database. When a database snapshot is created on a database, the transaction log of that database is examined, and crash-recovery is run on it, but *into the database snapshot* - the source database is totally unaffected. The message in the error log is from DBCC CHECKDB's hidden database snapshot starting up - but it's pretty misleading, I must admit.

One more twist to this behavior is that if there is a very long-running transaction in the database being consistency-checked, the database snapshot creation could take hours (because it has to wait for the long-running transaction to rollback, in the snapshot). If you think that the consistency check has hung, and try to kill it, you won't be able to and the SPID will appear as if it's in rollback. This is confusing because DBCC CHECKB doesn't do anything - so shouldn't have anything to rollback - but it's the database snapshot creation that's the problem. Once crash recovery has started on the database snapshot, it can't be interrupted - so you have to wait for it to finish (and the database snapshot to be created) before the DBCC command will actually stop, and remove the database snapshot. It's a weird side-effect, but a necessary one unfortunately. The recovery code could be changed to check for attention signals every so often I suppose, but I'm not holding-my-breath for that change.

Hope this helps.

As you may know, DBCC CHECKDB (and some of the other DBCC CHECK*) commands use an internal database snapshot to get a transactionally consistent view of the database (if you didn't, see my blog post CHECKDB From Every Angle: Complete description of all CHECKDB stages for an explanation). It's entirely possible that the database snapshot may grow and run out of space, causing the DBCC command to fail, when the database has a lot of updates running concurrently with the DBCC command (see the blog post Database snapshots - when things go wrong for some examples of this happening). There have also been some nasty bugs in SQL Server and in Windows that can cause the database snapshot to fail, and possibly cause SQL Server to hang when DBCC CHECKDB is running multi-threaded.

These bugs have been fixed and are available in a variety of ways. Suresh Kandoth, an old friend and Senior Escalation Engineer in PSS, has published a nice post that summarizes the various bugs, along with associated KB articles and release vehicle details. Check it out at Sparse File Errors: 1450 or 665 due to file fragmentation: Fixes and Workarounds.

Well, it was released a year ago but I only just found out about it last week. The SQL CAT team published a whitepaper last year called Database Snapshot Performance Considerations under I/O-Intensive Workloads. For a variety of workloads and with different numbers of concurrent database snapshots, it covers:

  • How much time does it take to create a database snapshot?
  • Transaction and I/O patterns during database snapshot creation
  • Transaction throughput and response time with multiple database snapshots
  • Impact of database snapshot on index creation
  • Restoring the source database from a database snapshot
  • Recommendations

Check it out at http://download.microsoft.com/download/d/9/4/d948f981-926e-40fa-a026-5bfcf076d9b9/DBSnapshotPerf.docx

Before I start, I want to make it clear that you can only hit this bug if you ALREADY have corruption, that it's quite rare, and that there is a workaround.

I've noticed a few more people in the forums having CHECKDB fail with this particular error in the last month

Msg 8967, Level 16, State 216, Line 1
An internal error occured in DBCC which prevented further processing. Please contact Product Support.

instead of completing properly and listing the corruptions in the database.

Whenever CHECKDB is using a database snapshot, it must check that the page it read through the snapshot does not have an LSN (Log Sequence Number) higher than that when the snapshot was created. If it did, this would mean that the page was modified AFTER the snapshot was created and hence CHECKDB would be working from an inconsistent view of the database. If this case is discovered, CHECKDB stops immediately. When I rewrote CHECKDB for SQL Server 2005, I changed a bunch of code assertions into seperate states of the 8967 error, so that CHECKDB would fail gracefully if some condition occured that indicates a bug or something that should never happen. State 216 is for the bad LSN condition I've just described.

I used to think it was caused by a race condition with the NTFS code that implements sparse files, which is used by the hidden database snapshot that CHECKDB uses by default. However, I've come to learn that this is a bug in CHECKDB (not one of mine I should say :-)) that causes this behavior under certain circumstances when corruption is present. The bug is that if a corrupt page fails auditing inside CHECKDB, the LSN check is still performed. If the corruption affects the LSN stamped in the page header, the 8967 error could be triggered. I've seen this a handful of times in the last few weeks - hence the need for a blog post. I've discussed this with the dev team and hopefully the fix will make it into the next SPs for 2005 and 2008 (too late to fix such a rare problem in such a critical component at this stage of 2008 development). They're going to put a KB article together too - but in the meantime, I wanted to get this on the Internet so Google/Live Search pick it up.

[Edit: June 24 2010 - this still hasn't been fixed AFAIK - see KB 960791 which just discusses the workaround.)

Now let's repro the problem. Starting with a simple database and table, I'll find the first page so I can corrupt it.

CREATE DATABASE TestCheckdbBug;
GO
USE TestCheckdbBug;
GO
CREATE TABLE test (c1 INT, c2 CHAR (5000));
INSERT INTO test VALUES (1, 'a');
GO EXEC sp_AllocationMetadata 'test';
GO

Object Name  Index ID  Alloc Unit ID      Alloc Unit Type  First Page  Root Page  First IAM Page
-----------  --------  -----------------  ---------------  ----------  ---------  --------------
test         0         72057594042318848  IN_ROW_DATA      (1:143)     (0:0)      (1:152)

Now I'm going to corrupt the page type on page (1:143) to be 255 (an invalid page type), which will guarantee the page fails the audit inside CHECKDB.

DBCC CHECKDB ('TestCheckdbBug') WITH ALL_ERRORMSGS, NO_INFOMSGS;
GO

Msg 8928, Level 16, State 1, Line 1
Object ID 2073058421, index ID 0, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type In-row data): Page (1:143) could not be processed. See other errors for details.
Msg 8939, Level 16, State 6, Line 1
Table error: Object ID 2073058421, index ID 0, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type In-row data), page (1:143). Test ((m_type >= DATA_PAGE && m_type <= UNDOFILE_HEADER_PAGE) || (m_type == UNKNOWN_PAGE && level == BASIC_HEADER)) failed. Values are 255 and 255.
CHECKDB found 0 allocation errors and 2 consistency errors in table 'test' (object ID 2073058421).
CHECKDB found 0 allocation errors and 2 consistency errors in database 'TestCheckdbBug'.
repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (TestCheckdbBug).

Now I'm going to corrupt the LSN on that page such that it's guaranteed to be higher than the creation LSN of the database snapshot (basically by filling the first part of the page header LSN field with 0xFF).

DBCC CHECKDB ('TestCheckdbBug') WITH ALL_ERRORMSGS, NO_INFOMSGS;
GO

Msg 8967, Level 16, State 216, Line 1
An internal error occurred in DBCC that prevented further processing. Contact Customer Support Services.
Msg 8921, Level 16, State 1, Line 1
Check terminated. A failure was detected while collecting facts. Possibly tempdb out of space or a system table is inconsistent. Check previous errors.

Bingo! And in the error log, there's some diagnostic information so we can tell which page caused the problem:

2008-05-22 14:55:01.95 spid53   DBCC encountered a page with an LSN greater than the current end of log LSN (31:0:1) for its internal database snapshot. Could not read page (1:143), database 'TestCheckdbBug' (database ID 15), LSN = (-1:65535:18), type = 255, isInSparseFile = 0. Please re-run this DBCC command.
2008-05-22 14:55:01.95 spid53   DBCC CHECKDB (TestCheckdbBug) WITH all_errormsgs, no_infomsgs executed by ROADRUNNERPR\paul terminated abnormally due to error state 1. Elapsed time: 0 hours 0 minutes 0 seconds.

Note the page ID (in black bold above) tells us the bad page and the LSN (in blue bold above) reflects the corruption that I caused. If the page ID field of the header was corrupt, it wouldn't be possible to tell from these diagnostics which page is corrupt. However, all is not lost. This bug means that under these circumstances the default online behavior of CHECKDB can't run. The workaround is to use the WITH TABLOCK option of CHECKDB, which does offline checking and doesn't need the snapshot - but the trade-off is that an exclusive database lock is required for a short time and then shared table locks for all tables in the database (this is why online is the default). Running this option on my corrupt database gives:

DBCC CHECKDB ('TestCheckdbBug') WITH TABLOCK, ALL_ERRORMSGS, NO_INFOMSGS;
GO

Msg 8928, Level 16, State 1, Line 1
Object ID 2073058421, index ID 0, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type In-row data): Page (1:143) could not be processed. See other errors for details.
Msg 8939, Level 16, State 6, Line 1
Table error: Object ID 2073058421, index ID 0, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type In-row data), page (1:143). Test ((m_type >= DATA_PAGE && m_type <= UNDOFILE_HEADER_PAGE) || (m_type == UNKNOWN_PAGE && level == BASIC_HEADER)) failed. Values are 255 and 255.
CHECKDB found 0 allocation errors and 2 consistency errors in table 'test' (object ID 2073058421).
CHECKDB found 0 allocation errors and 2 consistency errors in database 'TestCheckdbBug'.
repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (TestCheckdbBug).

Which are the exact same results we had before I corrupted the LSN field (this is expected, as there is no check of a page's LSN field EXCEPT when running from a database snapshot). Now we can proceed to restore/repair as appropriate.

So - a scary little bug that has caused some people headaches, but I want to stress again - this can only happen if the database is ALREADY corrupt, and that it's quite rare. Hope this helps some of you picking this up from search engines in the future.

A couple of weeks ago I blogged about the three tracks of the SQL Server 2008 JumpStart course that SQLskills.com taught internally for Microsoft and some MVPs - see here for details. Well, the content is now available to download! Note that this was based on CTP-5 (November 2007 CTP) and there have been *lots* of behavioral changes since then (with more planned for CTP-6 Refresh and RTM), but if you want a high-level overview of a bunch of the features (albeit in bullet-point summaries on slides) then this is a good place to start.

For me, what's *REALLY* cool is that the site also has a downloadable VPC plus lab manuals for all of the AlwaysOn High-Availability hands-on labs that SQLskills.com wrote. We originally wrote these labs for SQL Server 2005 and I updated them all for CTP-5. The VPC has a long lab on each of the following:

  • Database Snapshots
  • Data Recovery and Preventative Techniques
  • Instant Initialization
  • Peer-to-Peer Replication (including the new Topology Wizard I blogged about here)
  • Table and Index Partitioning
  • Snapshot Isolation
  • Online Operations
  • Database Mirroring (including a demo I wrote of Automatic Page Repair, described here)
  • Service Oriented Database Architecture

There is some great depth in each of these - Kimberly blogged more info about the exercises in each lab here. There's also another VPC image with some higher-level labs on a variety of 2008 features and written by a number of different people- including some labs on Policy-Based Management and Performance Data Collection that Kimberly wrote.

So - where can you get these from? Go to http://sqlserver2008jumpstart.microsofttraining.com/ and hit the Download link on the right-hand side. Register and then you can get to the materials. The AlwaysOn VPC image is Collection 2 at the bottom of the page, and you'll see all the slide decks as you scroll down the page.

Enjoy!

Here's an issue that I thought was a one-off but it just popped up again over the weekend so I want to publicize it.

DBA runs a DBCC CHECKDB and gets output like the following:

Msg 5172, Level 16, State 15, Line 1
The header for file 'E:\Data\namechanged.mdf:MSSQL_DBCC14' is not a valid database file header. The PageAudit property is incorrect.
Msg 5120, Level 16, State 9, Line 1
Unable to open the physical file "E:\Data\namechanged.mdf:MSSQL_DBCC14". Operating system error 0: "(null)".
Msg 1823, Level 16, State 2, Line 1
A database snapshot cannot be created because it failed to start.
Msg 7928, Level 16, State 1, Line 1
The database snapshot for online checks could not be created. Either the reason is given in a previous error or one of the underlying volumes does not support sparse files or alternate streams. Attempting to get exclusive access to run checks offline.

etc

Wow! Looks like something's seriously wrong with that database. Until you find that this happens on *all* the databases on the server. And on all the other servers in the data center too. What on earth's going on? I saw a similar issue once before, back in 2004 or 2005 while customers were testing pre-RTM IDW builds of SQL Server 2005. After running a test and doing some investigation for me, the person who just hit this issue over the weekend confirmed my hypothesis that he'd hit the problem I saw a few years ago.

The problem is a 3rd-party file encryption solution. It installs a file-system filter driver that filters all reads and writes to the database files so it can perform the encryption/decryption. Unfortunately it doesn't cope with NTFS alternate streams - which is how the internal database snapshots that DBCC CHECKDB uses are created - and so returns garbage for all reads from the database snapshot. The simple workaround is to create your own database snapshot (whose constituent files will exist in their own right, rather than being alternate streams of the source database's files) and then run DBCC CHECKDB against that.

Now - I don't want to cry 'wolf' here, but be careful of attributing massive corruption problems to the real IO subsystem if there's any kind of file-system filter driver installed.

This is a post I've been meaning to do for a while - detailing some of the problems you can run into when using database snapshots.

Reverting to a snapshot

Many people use snapshots as a way of protecting against mistakes during a complicated set of changes to a database - you can just revert to the snapshot if something goes wrong. However, reverting to a snapshot has a little-known (but documented) problem. Let's see. I'm going to create a database plus a snapshot of it and then start taking backups.

CREATE DATABASE SnapshotTest;
GO
ALTER DATABASE SnapshotTest SET RECOVERY FULL;
GO
BACKUP DATABASE SnapshotTest TO DISK = 'C:\SQLskills\SnapshotTest.bak' WITH INIT;
GO

CREATE TABLE SnapshotTest.dbo.MyTable (c1 INT);
GO

BACKUP LOG SnapshotTest TO DISK = 'C:\SQLskills\SnapshotTest_log.bak' WITH INIT;
GO

-- Imagine a bunch of things happen here

-- Create the snapshot database, by first checking which files exist
SELECT * FROM SnapshotTest.sys.database_files;
GO

CREATE DATABASE ST_Snap ON
(NAME = SnapshotTest, FILENAME = 'C:\SQLskills\SnapshotTest_snap.snp')
AS SNAPSHOT OF SnapshotTest;
GO

Now I'm going to do some stuff that's a mistake:

DROP TABLE SnapshotTest.dbo.MyTable;
GO

Not a problem as I can revert to my database snapshot:

RESTORE DATABASE SnapshotTest FROM DATABASE_SNAPSHOT = 'ST_Snap';
GO

Oops! I forgot to take a log backup to capture everything that happened since the last log backup...

BACKUP LOG SnapshotTest TO DISK = 'C:\SQLskills\SnapshotTest_log.bak';
GO
Msg 4214, Level 16, State 1, Line 1
BACKUP LOG cannot be performed because there is no current database backup.
Msg 3013, Level 16, State 1, Line 1
BACKUP LOG is terminating abnormally.

Reverting to a snapshot rebuilds the transaction log and breaks the log backup chain. The only thing I can do is take a full or differential backup and then start taking log backups again. So - I lost the ability to do point-in-time restores in the period from the last log backup to the time when I reverted to the snapshot.

So how can I tell when the database was reverted? Books Online documents that the restorehistory table in msdb should have an entry with restore_type = 'R'. Let's try:

SELECT * FROM msdb.dbo.restorehistory WHERE [destination_database_name] = 'SnapshotTest';
GO

Nope - nothing. That functionality seems to be broken. The only way I could find to trace reverting was in the error log:

2008-01-30 11:09:21.73 spid53      Reverting database 'SnapshotTest' to the point in time of database snapshot 'ST_Snap' with split point LSN 26000000013800001 (0x0000001a:0000008a:0001). This is an informational message only. No user action is required.
2008-01-30 11:09:21.74 spid53      Starting up database 'SnapshotTest'.
2008-01-30 11:09:21.74 spid53      The database 'SnapshotTest' is marked RESTORING and is in a state that does not allow recovery to be run.
2008-01-30 11:09:21.85 spid53      Starting up database 'SnapshotTest'.
2008-01-30 11:09:21.87 spid53      Starting up database 'ST_Snap'.

Note that the snapshot database is still there and continues to work.

Running out of space in a snapshot

The next problem that can happen is if you create a database snapshot on a volume that doesn't have much disk space. In that case you can actually run out of space and the snapshot will go suspect and be unusable until it's dropped and recreated. Let's see what happens:

-- Create snapshot of large database on a volume with not much space
SELECT * FROM SalesDB.sys.database_files;
GO

CREATE DATABASE SalesDB_Snap ON
(NAME = SalesDBData, FILENAME = 'D:\sqlskills\SalesDB_snap.snp')
AS SNAPSHOT OF SalesDB;
GO

ALTER INDEX ALL ON SalesDB.dbo.Sales REBUILD;
GO
Msg 5128, Level 17, State 2, Line 1
Write to sparse file 'D:\sqlskills\SalesDB_snap.snp' failed due to lack of disk space.

The query that causes the sparse file to run out of space doesn't fail, but the user running the query will see this error. Note that it doesn't say the snapshot has gone suspect. If you try to use it then it will tell you, plus there info written to the error log:

2008-01-30 11:50:29.14 spid53      Error: 17053, Severity: 16, State: 1.
2008-01-30 11:50:29.14 spid53      D:\sqlskills\SalesDB_snap.snp: Operating system error 112(There is not enough space on the disk.) encountered.
2008-01-30 11:50:29.15 spid53      Error: 3420, Severity: 21, State: 1.
2008-01-30 11:50:29.15 spid53      Database snapshot 'SalesDB_Snap' has failed an IO operation and is marked suspect.  It must be dropped and recreated.

Running out of space while running DBCC CHECKDB

One thing that isn't common knowledge is that DBCC CHECKDB uses database snapshots as it's mechanism for running online in SQL Server 2005. You can't control where the snapshot is created (it's actually created in the same location as the files comprising the database being checked - see my previous post on the DBCC CHECKDB stages here for more info) but you can create your own snapshot and check that. Anyway - it's possible to run out of space in the snapshot while DBCC CHECKDB is running. I setup a situation with a large database on a drive with only 1MB of space left, started a large index rebuild in the database and then tried to run DBCC CHECKDB. Let's see what happened:

The index rebuild statement:

ALTER INDEX ALL ON SalesDB2.dbo.Sales REBUILD;
GO
Msg 5128, Level 17, State 2, Line 1
Write to sparse file 'd:\sqlskills\salesdbdata.mdf:MSSQL_DBCC20' failed due to lack of disk space.

So some poor unsuspecting user gets this error and has no idea why. For DBAs its a bit obscure too. The syntax is describing an alternate stream on the existing data file and the snapshot in question can't be accessed at all.

The DBCC CHECKDB statement:

DBCC CHECKDB (salesdb2) WITH NO_INFOMSGS;
GO
Msg 1823, Level 16, State 1, Line 1
A database snapshot cannot be created because it failed to start.
Msg 1823, Level 16, State 2, Line 1
A database snapshot cannot be created because it failed to start.
Msg 7928, Level 16, State 1, Line 1
The database snapshot for online checks could not be created. Either the reason is given in a previous error or one of the underlying volumes does not support sparse files or alternate streams. Attempting to get exclusive access to run checks offline.
Msg 5128, Level 17, State 2, Line 1
Write to sparse file 'd:\sqlskills\salesdbdata.mdf:MSSQL_DBCC20' failed due to lack of disk space.
Msg 3313, Level 21, State 2, Line 1
During redoing of a logged operation in database 'salesdb2', an error occurred at log record ID (1628:252:1). Typically, the specific failure is previously logged as an error in the Windows Event Log service. Restore the database from a full backup, or repair the database.
Msg 0, Level 20, State 0, Line 0
A severe error occurred on the current command.  The results, if any, should be discarded.

Wow - major amounts of errors - and it all boils down to running out of space.

One of the sessions we did yesterday at IT Forum was an Instructor-Led Lab on database snapshots, an Enterprise Edition feature of SQL Server 2005. Database snapshots are not particularly well known in the DBA community and there are many misconceptions about them. The session generated a bunch of questions, some of which I'll answer here.

Q1) Can you change the data in table in a database snapshot?

A1) No.

Q2) Can you change the permissions in a database snapshot?

A2) Nope, sorry.

Q3) Can you backup a database snapshot?

A3) Afraid not.

A4) Can you detach a database snapshot?

A4) Err, nope.

You may have noticed a pattern here :-) Basically, the only things you can do with a database snapshot are select from it, and revert to it - i.e. rollback the entire database on which it is based to the point-in-time at which the database snapshot was created. Database snapshots are not updateable in any way, and at the time I left MS, there were no plans to change that in the future. Regardless of these limitations, database snapshots are very useful. I searched both our blogs for some examples to link to and couldn't find any so I'll put together a post on using them sometime over the next week or so.

The final question I *was* able to answer successfully for a very happy conference attendee:

Q5) Can you create a partial snapshot on a database, say a single filegroup, for reporting?

A5) No - HOWEVER, if you're on Enterprise Edition, you can make use of partial database availability. Hopefully you have your database structured so the primary filegroup has nothing but system tables in, and you have the table you're interested in isolated in a seperate filegroup - let's call it filegroup X. (He answered 'yes' to all three!). If that is that case, backup the primary filegroup and filegroup X. Then restore the primary filegroup followed by filegroup X in a separate location. As long as the primary filegroup of a database is online in SQL Server 2005 Enterprise Edition, then the database is online and any other online filegroups are available - partial database availability. This means you've effectively created a point-in-time snapshot of a single filegroup, albeit at the expense of having to have a full copy of the filegroup.

Theme design by Nukeation based on Jelle Druyts