SQLskills SQL101: Why does repair invalidate replication subscriptions?

SQLskills has an ongoing initiative to blog about basic topics, which we’re calling SQL101. We’re all blogging about things that we often see done incorrectly, technologies used the wrong way, or where there are many misunderstandings that lead to serious problems. If you want to find all of our SQLskills SQL101 blog posts, check out SQLskills.com/help/SQL101.

Whenever I’m teaching and recovering from corruption, I always stress that if the REPAIR_ALLOW_DATA_LOSS option of DBCC CHECKDB (which I’ll just call ‘repair’ from now on) has to be used, you should do the following:

  • Run another DBCC CHECKDB to make sure that repair fixed everything and no new problems were exposed (by the repair fixing something that was blocking DBCC CHECKDB‘s access to a portion of the database)
  • Run DBCC CHECKCONSTRAINTS on any tables or indexes that were repaired and are involved in constraints (stated in Books Online too)
  • Reinitialize any replication subscriptions that are based on tables that were repaired (stated in Books Online too)
  • Reevaluate your HA/DR strategy so you don’t have to use repair again in future

One question I’m often asked is about why replication can be broken by running repair.

There are two ways that replication can be affected: repairs on replication metadata tables, and repairs on anything else to do with a subscription.

Repairs on replication metadata tables

This is the simplest case to explain. If the repair operation affects any of the replication metadata tables (i.e. deleted some data from them), the entire replication publication will be in an inconsistent state and you should remove replication completely from the database an reinitialize it. This isn’t limited to a single subscription – all replication should be reconfigured.

Repairs on anything else

Transaction replication captures changes to the publication database by analyzing the transaction log, looking for transactions that change data in any of the publications, and converting those operations into logical operations that can be applied to the subscribers. Merge replication captures changes to the publication database using DML triggers and converting those operations into logical operations that can be applied to the subscribers.

Neither of these mechanisms can capture of operations performed by repair. Repair operations are always direct physical changes to the database structures to fix inconsistencies in the structures (e.g. a database page, table record, or a linkage between two pages), as opposed to physical changes because of queries performing inserts, updates, or deletes on tables.

These repair operations cannot translated into logical operations that can be applied to replication subscribers because there are no logical operations than can be expressed using T-SQL for the equivalent of the direct structural changes that repair is performing. Replication does not preserve the exact physical location of a particular record between the publication and subscription databases, so a direct change to record Y on page X in the publication database would not be able to be replayed on the subscription database (remember, replication ships logical changes, not physical changes). This means that if any table that is part of a replication publication is changed by a repair operation, the replication subscription is no longer valid and must be reinitialized.

As an example, imagine a repair operation is forced to remove a data page from a unique clustered index (essentially deleting some table records), and the subscription is NOT reinitialized. Those records would still exist on the replicated copy of the table. If a subsequent insert operation inserted records with cluster key values corresponding to the records deleted by the repair operation, the Distribution Agent would fail when trying to apply the inserts to the replicated copy – because the repair operation was not applied to the subscription database and a duplicate key violation error will occur when attempting to apply the insert to the replicated table.

A replication subscription must always be reinitialized if any table in the publication is affected by a repair operation, or the replication metadata tables are repaired.

Thanks

REPLICATION preventing log reuse but no replication configured

Last week, for the second time in as many weeks, I was sent a question in email from someone who had a transaction log that was growing out of control. He’d already queried log_reuse_wait_desc for the database (see this post for some more background) and the result was REPLICATION.

The problem was, there was no replication configured for that database. Just for completeness, he tried turning off the publish and merge publish settings for the database with sp_replicationdboption, ran sp_removedbreplication on the database, and then when those didn’t work, he also tried configuring and then removing replication. Nothing worked and the transaction log kept growing.

The problem turned out to be Change Data Capture. CDC uses the replication log scanning mechanism to harvest changes from the database, either piggy-backing on replication’s Log Reader Agent job or creating it’s own capture job if replication isn’t configured. If CDC is configured but the capture job isn’t running, the log_reuse_wait_desc will show as REPLICATION, as the log manager doesn’t have any way to know *why* the replication log scanner is configured, just that it is, and it hasn’t run.

So, if you ever see REPLICATION as the log_reuse_wait_desc and don’t have replication configured, check the is_cdc_enabled flag in sys.databases too. And then either figure out why the CDC capture job isn’t running correctly (see Change Data Capture Agent Jobs in this BOL entry), or remove CDC if it’s not supposed to be there (see this BOL entry).

Hope that helps a few people!

In defense of transactional replication as an HA technology

Yesterday on Twitter a few people expressed disgust at a conference slide listing transactional replication as a high-availability (HA) technology. I took exception to that and argued. I think the discussion merits a blog post so here it is. This point of view is controversial and I'm expecting some dissenting comments – bring them on!

I teach high-availability inside Microsoft and at the MCM level and I absolutely always teach that transactional replication, and it's more advanced brother – peer-to-peer replication, have their place as a high-availability technology. The two whitepapers on high availability I've written for Microsoft over the last two years also discuss transactional replication as a high-availability technology (links later).

To start with, let's define a high-availability technology as one which makes data more highly available in the event of a disaster. That definition does not imply any limits on the amount of data loss or downtime experienced – and to do so would be incorrect.

As some background, many people approach HA planning the wrong way – either by trying to shoe-horn a new strategy onto an incumbent and unsuitable technology or just picking the technology they've heard of – which is most commonly failover clustering. Planning an HA strategy means carefully laying out the business requirements, considering technical and non-technical limitations, reaching a compromise between the technologists and the business managers, and THEN starting to evaluate technologies. Once you start evaluating technologies you'll find that no single HA technology that SQL Server provides is a one-size-fits-all solution – they all have pros and cons. The resulting strategy will likely utilize multiple technologies to meet the strategy's needs.

You can read in more depth about my HA planning methodology in the whitepaper I wrote for Microsoft last year: High Availability with SQL Server 2008. This also has a good overview of transactional and peer-to-peer replication.

There are seven main arguments I hear against transactional replication as an HA technology – I'll take them on one at a time.

Argument 1: It's a scale-out technology, not an HA technology

So where's the actual argument here? Sure, transactional replication and it's more advanced brother, peer-to-peer replication, are primarily query scale-out technologies. But both provide extra copies of some data, so both can be used to provide higher availability of that data in the event of a disaster.

Argument 2: There's too much latency involved – how can that be HA?

High availability does not mean zero data loss and zero downtime. High availability means the data is more available in the event of a disaster than if you had no redundant copy. Sure, there is some latency involved with transaction replication but there's also latency involved with log shipping, and with asynchronous database mirroring.

The key is making sure that the availability provided by the technology fits within your desired downtime and data loss SLAs (service level agreements) – commonly known as RTO (recovery time objective) and RPO (recovery point objective), respectively. If you're fine with potentially losing 1/2 hour of data, and the transactional replication latency between the Publisher and Subscriber is only 10 minutes, then transactional replication can meet your data loss requirements.

Argument 3: It's really hard to know what you're missing if the Publisher crashes

Correct. But it's just the same as the other asynchronous technologies – log shipping and asynchronous database mirroring – so this is not a valid argument against transactional replication.

Argument 4: Replication is too complicated to set up

This argument is asinine IMHO. A technology may be complicated to use or troubleshoot, but that does not make it unsuitable – it just means you have to have someone who knows what they're doing. I've seen people mess up configuring log shipping – does that mean it's not an HA technology? No. Failover clustering can be fiendishly difficult to set up at the Windows level – does that mean it's not an HA technology? No. This argument is also invalid.

Argument 5: It only replicates some of the data

Correct. This means it's unsuitable if you'd like to protect an entire database or several databases – but if its restrictions work for you, no problem.

Argument 6: You can't make certain schema changes on a published table

Correct. But that doesn't mean it's not suitable for providing HA – as long as you can live with its restrictions.

Argument 7: There's no automatic failure detection or automatic failover

Correct. But neither does log shipping. This just means you have to provide the mechanisms for this. Transactional replication still means your data is available if the Publisher crashes.

[Edit: Argument 8: There's no easy way to fail back again

Correct – for transactional replication – that can be a PITA. With peer-to-peer replication that problem disappears because the changes that happened while one peer node was down are automatically pushed to it when it comes back online.]

Arguments FOR using transactional replication

So these arguments are really saying that transactional replication has some restrictions. Sure it does. As I said above, no single HA technology is a one-size-fits-all. Now, what about all the good things about transactional replication?

  • It only replicates the data you're interested in. Unlike log shipping or database mirroring, transactional replication only ships changes to the data you're interested in, rather than the entire database. Yes, there are applications where this is good. And it can be especially useful if you have to perform regular index maintenance on large indexes and you don't have the network bandwidth to ship transaction log for these.
  • It works in the SIMPLE recovery model, unlike log shipping or database mirroring, where you MUST be in the FULL recovery model (BULK_LOGGED is also permissible with log shipping). You don't have to perfom log management with log backups just because you're using an HA technology. Yes, there are strategies who don't want to be able to restore the entire database to a point in time.
  • You can have as many Subscribers as you want. Well, to a point. You need to be careful of network bandwidth from the Distributor, the Distribution Agent location (especially if they're all push subscriptions), and some other things. Log shipping also allows as many secondaries as you want but database mirroring can only have one mirror.
  • You can make use of republishing (having a Subscriber that is also a Publisher to down-stream Subscribers), which allows some interesting topologies and balancing of network traffic. No other SQL Server HA technology allows this.
  • The replicated data is available for reading and updating on both sides (and changes can then be published again using updateable subscriptions, or more easily with peer-to-peer replication). No other SQL Server HA technology allows this and this is one of the primary reasons why transactional replication is sometimes the necessary choice.
  • There's no wait time when a failover is necessary – the data is just there. With log shipping, recovery has to complete – which could take a while for any uncommitted transactions to roll back. With database mirroring, the REDO queue on the mirror must be recovered before the mirror database can come online as the principal. 

Summary

It all comes down to making rational choices of technologies based on the HA strategy requirements and limitations. Transactional replication will work for some of you, and not for others – it has its limitations just like the other SQL Server HA technologies. But it should never be dismissed out-of-hand as not being a high availability technology.

Take a look at the whitepaper I wrote for Microsoft this year – Proven SQL Server Architectures for High Availability and Disaster Recovery – where one of the five real-world customer examples (with customer names, case studies, etc) is about using transactional and peer-to-peer replication. That example links to a very comprehensive document the SQL CAT team wrote about the decision process that led to using transactional replication and all the setup and tuning that was done.

[Edit: one more whitepaper I wrote in 2008 that you might be interested in: SQL Server Replication: Providing High Availability using Database Mirroring.]

Thanks for reading!

PS If you post a comment and it doesn't appear it's because I'm moderating comments against spam – not to filter out dissention. Be patient and it will appear!