(The Curious Case of… used to be part of our bi-weekly newsletter but we decided to make it a regular blog post instead so it can sometimes be more frequent. It covers something interesting one of us encountered when working with a client, doing some testing, or were asked in a random question from the community.)
We’ve had a few new clients come to us recently after experiencing corruption, and they’ve been worried about whether physical corruption can propagate to secondary databases (like an availability group secondary or log shipping secondary) through the mechanism used to maintain the secondary database in sync with the primary database. I explained how it’s very rare for that to happen, but sometimes it does, and not in a way you’d think. Read on…
Physical corruptions are mostly caused by something in the I/O subsystem – all the hardware and software underneath SQL Server, including the OS and filter drivers. I literally mean *anything* under SQL Server’s buffer pool in the I/O stack. Corruptions can happen in the data files or the log file.
Secondary copies are maintained by shipping log blocks, containing log records, and replaying the log records. For availability groups and database mirroring, there’s a constant copy-from-primary-and-replay-on-secondary process going on, and for log shipping, it’s a back-up-log-copy-restore process – but it’s essentially the same thing: replaying log records.
Apart from the initial full backup that is taken of the primary database and used to create the secondary database, data file pages are never subsequently copied to a secondary database when using availability groups or mirroring. They may be copied during log shipping, if an index operation is performed in the bulk-logged recovery model. Note that if the initial full backup backs up a corrupt database, the secondary copies initialized from that full backup will of course be corrupt in the same way. But that’s not what we’re talking about here.
Let’s assume that the database in question was created with page checksums enabled, as it’s been the default for 15 years now.
For corruptions caused by the I/O subsystem, there are several cases to consider:
- Data file page corrupted. When the page is read back into memory, the page checksum will be discovered to be invalid and an 824 error will result. The only possible way a data file page can be copied to a secondary is in the log shipping case I explained above, and as long as the BACKUP LOG statement uses the CHECKSUM option, the corruption will be discovered and the backup will fail.
- Log block corrupted. The log reader will notice the block is corrupt when reading it, or the log replay on the secondary will fail.
- Log backup is corrupted before replay on a secondary. As long as the BACKUP LOG and RESTORE LOG statement both use the CHECKSUM option, this will be discovered and the restore will fail.
Basically, a corruption caused by the I/O subsystem will not propagate, as long as page checksums are enabled.
THIS IS WHY YOU CAN’T OFFLOAD CONSISTENCY CHECKS TO A SECONDARY COPY!
Yes, I shouted that. Running DBCC CHECKDB on a secondary copy tells you nothing about the primary copy, or any other secondary copy. You have to consistency check *all* copies of the database.
Now let me worry you a bit…
There is a way for corruption to propagate to all copies – if a page gets physically corrupted in the buffer pool by bad memory or a SQL Server bug, it will then be written to disk with a valid page checksum. When it’s read back in to memory, the page checksum will still be valid, and the page will still contain the corruption. If that corruption happens to be in a column value, and that column value is used to calculate *another* value, which is persisted in the database, that incorrect value will be logged, and replayed on all the secondary copies.
I call this ‘second-order corruption propagation’. I’ve never seen it in the wild, but it’s theoretically possible.
Is there anything you can do about the possibility of it? No. You might think of ECC memory chips and error 832 from SQL Server 2012 onward, but that’s only checking whether *unchanged* pages in the buffer pool have been corrupted by bad memory. Once a page is changed in the buffer pool, the 832 checks don’t occur for that page.
But the chances of this happening in such a way that it is able to propagate are vanishingly small, so I wouldn’t worry about it.
Bottom line: my answer when asked whether physical corruption can propagate is this: as long as you’re using page checkums and backup/restore checksums, no.
PS Logical corruption is a different matter. If SQL Server messes up something like an index rebuild or an indexed view, that logical corruption will absolutely flow through to the secondary copies. You have to run DBCC CHECKDB everywhere!