One of the hottest features in SQL Server 2005 is database mirroring, and it's helped many companies implement successful and relatively inexpensive high-availability strategies. In SQL Server 2008, Database Mirroring has been enhanced in several ways – one of which is the ability to automatically repair corrupt pages!
This feature is based on the fact that the principal and mirror databases are exactly the same. So, if a page becomes corrupt on the principal, SQL Server should be able to read the page from the mirror and use it to fix the principal. Similarly, if a page becomes corrupt on the mirror, the page can be read from the principal to fix up the mirror. Pretty cool, eh?
- The feature works for pages that have 824 errors, 823 errors where the OS returns a CRC error while reading the page (to prevent resource issues triggering a page repair), and pages that have 829 errors (where the page is marked as restore pending). See my previous post here for more details on page errors.
- Pages are fixed asynchronously.
- If the page is corrupt in the principal, the query that hit the corrupt page will fail. Once discovered, a page is marked as being 829 until its fixed. This prevents an issue where a transient disk error could allow a subsequent update to change the page after it's been queued for being repaired, and then the page is overwritten with a copy from the mirror, losing the update. Nasty.
- If the page is corrupt in the mirror (which is discovered when the page is read as part of the continual recovery of the log), the mirroring session is suspended. The mirror keeps track of all corrupt pages that need to be repaired with copies from the principal. Once all corrupt pages have been repaired, the mirroring session will be resumed automatically. This means that if a page is corrupt in both the mirror and the principal, manual intervention will be required to resolve the issue.
- The feature is available in Standard and Enterprise Editions.
- There is a new DMV – sys.dm_db_mirroring_auto_page_repair – that allows you to track corrupt pages in mirrored databases
- It covers all mirrored databases on a server.
- It provides info on the last 100 pages that were found in any mirrored database, as well as the status of the automatic page repair operation.
- This DMV isn't yet included in the SQL Server 2008 Books Online available on TechNet but will have the following info:
- Database ID the page is in
- The Page ID, split into file and page-in-file
- The error type – distinguishing between 823 errors, torn-page errors, page checksum failures, and all-other-824 errors
- The status of the page repair operation
- The time that the status was last updated
- If a page repair fails for any reason (e.g. the mirroring partner couldn't supply the page) then the repair will be marked as failed. If the page is then hit again (by a query on the principal or a recovery operation on the mirror) then it will be re-queued for repair.
- Not all pages can be repaired – the file header page, database boot page, and allocation bitmap pages (GAM, SGAM, PFS) cannot be repaired this way.
I've got a system with a few SQL Server 2008 instances running so I decided to give it a try. Here's what I did, using the TicketSalesDB from the Always-On DVDs and labs that SQLskills produces:
- Changed the database to use page checksums
- Rebuilt the clustered index of one of the tables (so the pages have page checksums on)
- Took the initial backup of the database that's needed for mirroring (so that it has no corruptions in)
- Used DBCC IND to find a page in the leaf level of the clustered index and corrupted the page
- Setup mirroring (but using the backup I took before introducing the corruption – so the mirror database will have a clean copy of the page I corrupted)
- Performed a query on the table with the corrupt page to force the page checksum failure and kick-off automatic page repair
And it worked! Going back into the instance with the principal database and querying the DMV gives:
C:\>sqlcmd /E /S.\KATMAI01
1> SELECT * FROM sys.dm_db_mirroring_auto_page_repair;
database_id file_id page_id error_type page_status modification_time
———– ———– ——————– ———- ———– ———————–
6 4 4256 -1 5 2007-09-27 17:23:20.067
(1 rows affected)
The page I corrupted was (4:4256) and page_status of 5 means the repair succeeded. Running the query again confirms that the corruption has been fixed. The page was also logged in the suspect_pages table in msdb:
1> SELECT * FROM msdb..suspect_pages;
database_id file_id page_id event_type error_count last_update_date
———– ———– ——————– ———– ———– ———————–
6 4 4256 5 1 2007-09-27 17:23:20.407
(1 rows affected)
In Books Online (both 2005 and 2008), event_type of 5 means that the page was repaired.
I checked the SQL Server error log for the principal and this is what I found (the bolding is mine):
2007-09-27 17:17:10.41 spid25s Database mirroring is active with database 'TicketSalesDB' as the principal copy. This is an informational message only. No user action is required.
2007-09-27 17:23:19.92 spid51 Error: 824, Severity: 24, State: 2.
2007-09-27 17:23:19.92 spid51 SQL Server detected a logical consistency-based I/O error: incorrect checksum (expected: 0x55684fbe; actual: 0x16e84fbe). It occurred during a read of page (4:4256) in database ID 6 at offset 0x00000002140000 in file 'C:\Program Files\Microsoft SQL Server\MSSQL.6\MSSQL\Data\TicketSalesFG2005Q1.NDF'. Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.
2007-09-27 17:23:19.96 spid26s Database mirroring is attempting to repair physical page (4:4256) in database "TicketSalesDB" by requesting a copy from the partner.
2007-09-27 17:23:20.42 spid26s Database mirroring successfully repaired physical page (4:4256) in database "TicketSalesDB" by obtaining a copy from the partner.
I also checked the error log for the mirror and there's nothing relevant in there. I guess the same info would be output to the mirror database's error log if the mirror became corrupted. It's reasonably simple to check this using similar steps as above, but corrupting the database before the initial backup is taken, fixing the corruption again after the backup and before mirroring starts (so the corruption is on the mirror but not the principal), and then triggering an update on the corrupt page. When the update is replayed on the mirror, the corruption will be hit and the page repaired in the same way. I'll check later and blog if there's any difference.
SQL Server 2008 introduces an enhancement to database mirroring that can bi-directionally pull pages between the principal and mirror databases to fix page corruptions. One word of caution I'd give is that this feature doesn't mean you can ignore these errors when they occur – you still need to do root-cause analysis on the corruption and take steps to prevent them happening again before a corruption occurs that automatic page repair cannot fix (as I mentioned above).
Nevertheless, this is a tremendously useful feature that's going to save a lot of downtime. Cool!