New Pluralsight course: Understanding, Configuring, and Troubleshooting Database Mirroring

Glenn’s latest Pluralsight course has been published – SQL Server: Understanding, Configuring and Troubleshooting Database Mirroring. It’s 2.5 hours long, and from the course description:

SQL Server database mirroring is a very useful technology for high availability and disaster recovery, as well as for general purpose database movement, migration, and upgrades. In this course, SQL Server: Understanding, Configuring and Troubleshooting Database Mirroring, you’ll first learn the basics of database mirroring, its capabilities, and its limitations. Next, you’ll explore how to prepare an environment for database mirroring, and then how to configure database mirroring using either SQL Server Management Studio or T-SQL scripts. Then, you’ll learn how to monitor a database mirroring environment and troubleshoot common problems. Finally, you’ll discover how to use database mirroring for data migration and upgrading a database to a new version of SQL Server. When you’ve finished this course, you’ll have the skills and knowledge to start configuring, using and troubleshooting database mirroring.

The modules are:

  • Introduction
  • Database Mirroring Overview
  • Preparing for Database Mirroring
  • Configuring Database Mirroring
  • Monitoring and Troubleshooting Database Mirroring
  • Migrating Data Using Database Mirroring

Check it out here.

We now have more than 175 hours of SQLskills online training available (see all our 60 courses here), all for as little as $29/month through Pluralsight (including more than 6,000 other developer and IT training courses). That’s unbeatable value that you can’t afford to ignore.


When is fast recovery used?

It’s been a bit light on technical posts here over the last few months but now that summer’s over I’ll be ramping up again with lots of cool stuff planned.

First up is a question that came up on the MCM distribution list this morning. There was a discussion of fast recovery (which I explained in detail in the post Lock Logging and Fast Recovery back in 2009), but in a nutshell is the ability of Enterprise Edition to allow access to a database after the REDO (rolling forward committed transactions) phase of crash recovery has completed and before the UNDO (rolling back uncommitted transactions) phase of crash recovery has completed. The idea is that UNDO can take much longer than REDO, so early access to the database is a good thing, hence it being an Enterprise Edition feature (from SQL Server 2005 onward).

The question essentially became: when is fast recovery used?

The answer is that it’s used whenever a database is started up and needs to have recovery run on it. This means fast recovery will be used:

  • When SQL Server starts up after a crash or shutdown where a database was not cleanly shut down
  • After a cluster failover
  • After a database mirroring failover
  • After an availability group failover
  • When a database state is changed to ONLINE and crash recovery needs to be run

Note that I did not include:

  • When restoring a database from backups
  • When bringing a log shipping secondary database online (this is restoring from backups)

Fast recovery is NOT used during a restore operation. You’ll read in some places online that it is, but those places are incorrect.

So why isn’t it used during a restore sequence?

It’s to do with the underlying mechanism that allows fast recovery. Operations that make changes to a database are logged, and the log record includes a bitmap of what locks were held at the time (examples of this are in the blog post I referenced above). When crash recovery runs, the REDO phase also acquires all the locks necessary to do the UNDO phase, as the REDO phase knows which transactions in the log being recovered need to be rolled back. At the end of the REDO phase, access can be given to the database because recovery can guarantee that no user can block the UNDO phase, as the UNDO phase locks are already held.

So why doesn’t that mechanism work for restores? Well restore doesn’t do one REDO and one UNDO like crash recovery does. For each backup that is restored in the restore sequence, the REDO phase of recovery is performed. This avoids having a really long REDO phase at the end of the restore sequence (which could be, say, a week’s worth of transactions spread over tens or hundreds of backups), and having to have a huge transaction log to hold all those log records.

At the end of the restore sequence, all necessary REDO has already been performed, but the REDO operations have NOT been acquiring UNDO locks. The UNDO locks aren’t acquired because UNDO isn’t likely to be the next phase during a restore sequence. It’s likely to be another restore operation. In that case, it’s likely that some of the transactions that were uncommitted at the end of the last restore become committed during the next restore, so if UNDO locks had been acquired, they would have to be released again. This would involve either rescanning the log records involved or keeping track of which in-restore transactions had acquired which locks. Either of these would be complicated and time consuming, so the benefit hasn’t been deemed worthwhile for the engineering effort involved.

So no fast recovery during restores.

But hold on, I hear you say, database mirroring is just a constant REDO process so how come fast recovery works for that? Back in SQL Server 2005, when a database mirroring failover occurred, the database was momentarily set offline so that full crash recovery would be run when the database came back online, hence allowing fast recovery to work. From SQL Server 2008 onward, that doesn’t happen any more, so there is a mechanism that figures out what UNDO locks are necessary when a mirroring failover occurs, allowing fast recovery behavior. I guess technically that same mechanism could be ported over to the restore code base, but I think it would be difficult to do, and I don’t think there’s enough demand to make the engineering effort and possible destabilization of the restore code worthwhile.

Hope this helps explain things – let me know if you have any questions.

Database mirroring: avoiding ‘cannot obtain a LOCK resource’ problems

This is an interesting case that came up recently, and that I’ve seen a few times before, but which I haven’t seen explained anywhere.

Using database mirroring, you may see 1204 and 1454 errors on the mirror server with no other databases or activity on the mirror server apart from database mirroring:

2013-01-15 12:00:00.410 spid29s Error: 1204, Severity: 19, State: 4.
2013-01-15 12:00:00.410 spid29s The instance of the SQL Server Database Engine cannot obtain a LOCK resource at this time. Rerun your statement when there are fewer active users. Ask the database administrator to check the lock and memory configuration for this instance, or to check for long-running transactions.
2013-01-15 12:00:00.430 spid29s Error: 1454, Severity: 16, State: 1.
2013-01-15 12:00:00.430 spid29s Database mirroring will be suspended. Server instance 'instancename' encountered error 1204, state 4, severity 19 when it was acting as a mirroring partner for database 'dbname'. The database mirroring partners might try to recover automatically from the error and resume the mirroring session. For more information, view the error log for additional error messages.

With no activity on the mirror server apart from mirroring, why are locks being taken?

The answer is that locks must be acquired when a transaction is rolled back on the mirror and is just how logging and recovery work. I discussed this some in my post Lock logging and fast recovery back in 2009 – locks are acquired during rollback to ensure that other transactions don’t prevent rollback from completing. The behavior is the same when a mirrored transaction rolls back – the locks need to be acquired on the mirror before the rollback can commence.

You can easily see this for yourself. Create a mirroring session and then a constant insert workload of small transactions. If you look at the locks on the mirror server using sys.dm_tran_locks, you’ll see no locks being acquired in the mirror database. Now create a transaction on the principal server that inserts 10,000 records. Roll it back and quickly examine the locks on the mirror server – you’ll see thousands of locks being acquired in the mirror database.

Why can we see 1204 errors with database mirroring?

Each lock structure takes up a small amount of memory and the more locks that are acquired and held at any one time, the more memory is required for the lock structures. On a memory-constrained system it’s possible to run out of lock memory and result in error 1204.

With database mirroring, if the principal server has a lot more memory than the mirror server, and there are a lot of large rollbacks occurring, or there a multiple mirroring sessions where rollbacks are occurring, it’s possible that the mirror server will run out of lock memory, resulting in errors 1204 and 1454 as above.

How can we stop 1204 errors occurring?

The easiest fix for this (and to hopefully prevent it happening) is to ensure that the mirror server has the same amount of memory as the principal server – so the principal should not be able to create more locks than the mirror server can accommodate. Even then, you may still see these errors if there is a lot of rollback activity across mirrored databases and it all happens to be replayed on the mirror server at the same time – in which case you need more memory on the mirror server than on the principal.

You need to also ensure that nothing else is using memory on the mirror server – e.g. using database snapshots, or any other activity apart from just mirroring.

I hope this explanation helps prevent some of you out there wasting time trying to troubleshoot this issue.