The Curious Case of… whether corruption can propagate to secondary databases

(The Curious Case of… used to be part of our bi-weekly newsletter but we decided to make it a regular blog post instead so it can sometimes be more frequent. It covers something interesting one of us encountered when working with a client, doing some testing, or were asked in a random question from the community.)

We’ve had a few new clients come to us recently after experiencing corruption, and they’ve been worried about whether physical corruption can propagate to secondary databases (like an availability group secondary or log shipping secondary) through the mechanism used to maintain the secondary database in sync with the primary database. I explained how it’s very rare for that to happen, but sometimes it does, and not in a way you’d think. Read on…

Physical corruptions are mostly caused by something in the I/O subsystem – all the hardware and software underneath SQL Server, including the OS and filter drivers. I literally mean *anything* under SQL Server’s buffer pool in the I/O stack. Corruptions can happen in the data files or the log file.

Secondary copies are maintained by shipping log blocks, containing log records, and replaying the log records. For availability groups and database mirroring, there’s a constant copy-from-primary-and-replay-on-secondary process going on, and for log shipping, it’s a back-up-log-copy-restore process – but it’s essentially the same thing: replaying log records.

Apart from the initial full backup that is taken of the primary database and used to create the secondary database, data file pages are never subsequently copied to a secondary database when using availability groups or mirroring. They may be copied during log shipping, if an index operation is performed in the bulk-logged recovery model. Note that if the initial full backup backs up a corrupt database, the secondary copies initialized from that full backup will of course be corrupt in the same way. But that’s not what we’re talking about here.

Let’s assume that the database in question was created with page checksums enabled, as it’s been the default for 15 years now.

For corruptions caused by the I/O subsystem, there are several cases to consider:

  1. Data file page corrupted. When the page is read back into memory, the page checksum will be discovered to be invalid and an 824 error will result. The only possible way a data file page can be copied to a secondary is in the log shipping case I explained above, and as long as the BACKUP LOG statement uses the CHECKSUM option, the corruption will be discovered and the backup will fail.
  2. Log block corrupted. The log reader will notice the block is corrupt when reading it, or the log replay on the secondary will fail.
  3. Log backup is corrupted before replay on a secondary. As long as the BACKUP LOG and RESTORE LOG statement both use the CHECKSUM option, this will be discovered and the restore will fail.

Basically, a corruption caused by the I/O subsystem will not propagate, as long as page checksums are enabled.

THIS IS WHY YOU CAN’T OFFLOAD CONSISTENCY CHECKS TO A SECONDARY COPY!

Yes, I shouted that. Running DBCC CHECKDB on a secondary copy tells you nothing about the primary copy, or any other secondary copy. You have to consistency check *all* copies of the database.

Now let me worry you a bit…

There is a way for corruption to propagate to all copies – if a page gets physically corrupted in the buffer pool by bad memory or a SQL Server bug, it will then be written to disk with a valid page checksum. When it’s read back in to memory, the page checksum will still be valid, and the page will still contain the corruption. If that corruption happens to be in a column value, and that column value is used to calculate *another* value, which is persisted in the database, that incorrect value will be logged, and replayed on all the secondary copies.

I call this ‘second-order corruption propagation’. I’ve never seen it in the wild, but it’s theoretically possible.

Is there anything you can do about the possibility of it? No. You might think of ECC memory chips and error 832 from SQL Server 2012 onward, but that’s only checking whether *unchanged* pages in the buffer pool have been corrupted by bad memory. Once a page is changed in the buffer pool, the 832 checks don’t occur for that page.

But the chances of this happening in such a way that it is able to propagate are vanishingly small, so I wouldn’t worry about it.

Bottom line: my answer when asked whether physical corruption can propagate is this: as long as you’re using page checkums and backup/restore checksums, no.

PS Logical corruption is a different matter. If SQL Server messes up something like an index rebuild or an indexed view, that logical corruption will absolutely flow through to the secondary copies. You have to run DBCC CHECKDB everywhere!

18 thoughts on “The Curious Case of… whether corruption can propagate to secondary databases

  1. How about the following scenario: the system runs an AlwaysOn availability group in an environment where to force developers to design reliable systems every day random servers are wiped and set up from scratch (the setup is fully automated). The part of the setup for servers that are supposed to be members of that availability group is roughly as follows: if machine starts with a SQL server with no expected DBs or those DBs are not part of the availability group, it finds out which server is current primary in that AG, triggers a DB backup, restores it on a local server, then joins AG. I am pretty sure that’s how corruption spread in our DBs to secondary replicas at my previous job, but when I tried to explain that to our DB guys (that were not directly involved in maintaining those DBs) they said I am crazy.

  2. Excelent! Thanks for sharing this level of information about SQL!

    Because all of this you mentioned, is it a good strategy always restore last full/diff backup before a successfully CHECKDB, right?

    In a scenario where your database is corrupted now, any full/diff backup after a succesffuly checkdb “can be corrupted”.

    But, t-log backups, “describes” the changes made to these pages in full, so, a I/O problem that caused a data page to be corrupted, probably will not be present on t-log (if it don’t raised any errors when backup was made), and restoring t-log will not replicate the corruption (excluding that cases you mentioned).

    All this is the correct practice when restoring to recover a corrupted database?

    1. Not necessarily – the most recent backup with no corruption may be after the last successful CHECKDB – just depends when the corruption occurred.

      Yes – any data backup after a corruption will contain corruption.

      Yes – your statement about t-log backups is correct.

      Yes – restore the most recent full with no corruption, the most recent diff with no corruption (if one exists), and log backups.

  3. I have personally had corruption propagate from one server to another. This was under 4.21a – yes, I’m old –
    and our DEC ALPHA 2100s had mismatched processor speeds. I home-brewed my tranlog backup/restore process and
    corruption would occur on the primary and make its way into the tranlog, where it was then picked up and re-
    played on the backup. Of course, this was all before the current checksum setting and so on, but I have had
    this happen. It took 3 DEC senior engineers to figure out the problem, because the processors were all correctly
    part marked, but the clock speed on one CPU was wrong, and when more than 1 CPU was used, it seems this is
    where the corruption would occur.

  4. Is it ok (and does it make sense) to keep just the physical-only check on the primary (to check for physical corruption) and offload the full check to the secondary? Does a clean full check on secondary ensure there is no logical corruption on the primary as well (since logical corruption propagates from the primary to the secondary)? And from the other angle – does a logical corruption error on the secondary necessarily mean there’s logical corruption on the primary as well or could it be on the secondary only?

    Also, can you offload the full check to a restored backup as well (keep just the physical check on the actual “live” DB) and be certain that a clean check on the backup means there’s no logical corruption on the “live” DB?

    A specific scenario (to keep just the physical-only check on the primary and offload the full check to the secondary/backup) I had in mind is: A large database (5+ TB); during a full CHECKDB you’re hitting error 665. A physical-only CHECKDB goes through though.

    1. No – I wouldn’t do that, because of the possibility of a physical corruption in a page being written to disk on the primary with a good page checksum. You need to do a full CHECKDB on the primary, or a restored copy of it at some regular interval. As long as the backup was taken of the primary DB, and it comes back clean, you know that when the backup was taken, there was no corruption.

      1. Thanks for the reply.

        I have another question: Does the same rule (“As long as the backup was taken of the primary DB, and it comes back clean, you know that when the backup was taken, there was no corruption.”) apply if the backup tool is MS DPM (Data Protection Manager), using “snapshot” backups? The restore then re-creates the DB files.

          1. “The entire database” in what sense – as opposed to just some (but not all) filegroups?

            The “snapshot” backups only back up the differences in the DB file, afaik (so it can be really fast (minutes or even seconds) even on VLDBs). I’m sure my understanding is way oversimplified :).
            If DPM does that for all database files, is that what you meant as “the entire database”?

  5. Hi Paul. Thanks for this article. I have a customer who runs CHECKDB on their secondary but performs their backup on their primary, using the WITH CHECKSUM option.

    Your article here:
    https://www.sqlskills.com/blogs/paul/a-sql-server-dba-myth-a-day-2730-use-backup-with-checksum-to-replace-dbcc-checkdb/

    ….explains why you can’t rely on CHECKSUM as an alternate for CHECKDB, and you describe 3 reasons:

    1 – Database upgraded from 2000 – NOT AN ISSUE FOR THEM
    2 – Frequency of full-backups – THEY TAKE WEEKLY FULL AND DAILY DIFFS
    3 – In memory corruptions – ???

    My question is – would these in-memory corruptions be propagated to the secondary, and therefore be caught by CHECKDB running on the secondary? If yes, does that mean that their current solution of backing up on primary WITH CHECKSUM, and running CHECKDB on the secondary is ‘safe’? I’m conscious that you very clearly state “Bottom line – you can’t avoid running consistency checks.”, however, I’d like to better understand why. Would a better solution to take the backup on the secondaries, so that at least they’re backing up the database that has had CHECKDB run?

    1. It depends on what the corruption is. If a corruption changes a column value, which is then logged, then it will propagate. If a corruption changes something and it isn’t logged, and the page is changed in some other way so it will be written back to disk as a dirty page, that corruption will not propagate and the page on disk will have a correct checksum. That’s why there’s no alternative to running checkdb on all copies in an AG. Yes, it is always better to do backups on a copy that has checkdb running on it, but then they can’t take diff backups.

Leave a Reply

Your email address will not be published. Required fields are marked *

Other articles

Imagine feeling confident enough to handle whatever your database throws at you.

With training and consulting from SQLskills, you’ll be able to solve big problems, elevate your team’s capacity, and take control of your data career.