The Curious Case of… emergency-mode repair

(The Curious Case of… used to be part of our bi-weekly newsletter but we decided to make it a regular blog post instead so it can sometimes be more frequent. It covers something interesting one of us encountered when working with a client, doing some testing, or were asked in a random question from the community.)

I had a blog comment question a few days ago that asked why emergency-mode repair requires the database to be in EMERGENCY mode as well as SINGLE_USER mode.

All repair operations that DBCC CHECKDB (and related commands) performs require the database to be in single-user mode so there’s a guarantee that nothing can be changing while the checks and repairs are done. But that doesn’t change the behavior of what repair does – that needs emergency mode too.

Emergency mode is internally known as ‘bypass recovery’ mode, and is used when a database hasn’t had crash recovery run on it, because of corruption issues or a missing or damaged log file. It allows access to the database in the unrecovered state, which may mean the database has inconsistencies in its structures and/or data because of in-flight transactions that have not been rolled back (as recovery couldn’t run).

When a database is in emergency mode, it’s not possible to create a database snapshot (as that involves running recovery) and so a DBCC CHECKDB without a repair option simply treats the database as if it’s read-only, and runs the consistency checks. When a repair option *is* specified (and only REPAIR_ALLOW_DATA_LOSS is permitted) in emergency mode, that tells DBCC CHECKDB that the drastic, last-resort emergency mode repair should be performed. This will:

  • Run as much crash recovery as possible, skipping errors, unlike real crash recovery which will fail when it encounters an error
  • Delete the transaction log
  • Create a new transaction log
  • Run a regular REPAIR_ALLOW_DATA_LOSS check and repair
  • Bring the database online if possible, albeit maybe with corruptions still in the database

So emergency mode isn’t single-user mode, and vice versa – they’re both required for an emergency-mode repair to run.

You can read more about emergency-mode repair in my blog post EMERGENCY-mode repair: the very, very last resort.

SQLskills SQL101: Why does repair invalidate replication subscriptions?

SQLskills has an ongoing initiative to blog about basic topics, which we’re calling SQL101. We’re all blogging about things that we often see done incorrectly, technologies used the wrong way, or where there are many misunderstandings that lead to serious problems. If you want to find all of our SQLskills SQL101 blog posts, check out SQLskills.com/help/SQL101.

Whenever I’m teaching and recovering from corruption, I always stress that if the REPAIR_ALLOW_DATA_LOSS option of DBCC CHECKDB (which I’ll just call ‘repair’ from now on) has to be used, you should do the following:

  • Run another DBCC CHECKDB to make sure that repair fixed everything and no new problems were exposed (by the repair fixing something that was blocking DBCC CHECKDB‘s access to a portion of the database)
  • Run DBCC CHECKCONSTRAINTS on any tables or indexes that were repaired and are involved in constraints (stated in Books Online too)
  • Reinitialize any replication subscriptions that are based on tables that were repaired (stated in Books Online too)
  • Reevaluate your HA/DR strategy so you don’t have to use repair again in future

One question I’m often asked is about why replication can be broken by running repair.

There are two ways that replication can be affected: repairs on replication metadata tables, and repairs on anything else to do with a subscription.

Repairs on replication metadata tables

This is the simplest case to explain. If the repair operation affects any of the replication metadata tables (i.e. deleted some data from them), the entire replication publication will be in an inconsistent state and you should remove replication completely from the database an reinitialize it. This isn’t limited to a single subscription – all replication should be reconfigured.

Repairs on anything else

Transaction replication captures changes to the publication database by analyzing the transaction log, looking for transactions that change data in any of the publications, and converting those operations into logical operations that can be applied to the subscribers. Merge replication captures changes to the publication database using DML triggers and converting those operations into logical operations that can be applied to the subscribers.

Neither of these mechanisms can capture of operations performed by repair. Repair operations are always direct physical changes to the database structures to fix inconsistencies in the structures (e.g. a database page, table record, or a linkage between two pages), as opposed to physical changes because of queries performing inserts, updates, or deletes on tables.

These repair operations cannot translated into logical operations that can be applied to replication subscribers because there are no logical operations than can be expressed using T-SQL for the equivalent of the direct structural changes that repair is performing. Replication does not preserve the exact physical location of a particular record between the publication and subscription databases, so a direct change to record Y on page X in the publication database would not be able to be replayed on the subscription database (remember, replication ships logical changes, not physical changes). This means that if any table that is part of a replication publication is changed by a repair operation, the replication subscription is no longer valid and must be reinitialized.

As an example, imagine a repair operation is forced to remove a data page from a unique clustered index (essentially deleting some table records), and the subscription is NOT reinitialized. Those records would still exist on the replicated copy of the table. If a subsequent insert operation inserted records with cluster key values corresponding to the records deleted by the repair operation, the Distribution Agent would fail when trying to apply the inserts to the replicated copy – because the repair operation was not applied to the subscription database and a duplicate key violation error will occur when attempting to apply the insert to the replicated table.

A replication subscription must always be reinitialized if any table in the publication is affected by a repair operation, or the replication metadata tables are repaired.

Thanks

SQL Server Magazine: feature article on using database repair

The September SQL Server Magazine articles are now available on the web and include my latest feature article on Using Database Repair for Disaster Recovery.

It includes a detailed walk-through of a disaster scenario where all backups include the corruption – showing you how to run repair and then try to recover some of the damaged data from an older backup.

You can get to the article HERE (no login required).

Enjoy!

PS Note that the posting date on the article says 2010 – the online editors at SQLMag will fix it shortly.