PFS corruption after upgrading from SQL Server 2014

I’m seeing reports from a few people of DBCC CHECKDB reporting PFS corruption after an upgrade from SQL Server 2014 to SQL Server 2016 or later. The symptoms are that you run DBCC CHECKDB after the upgrade and get output similar to this:

Msg 8948, Level 16, State 6, Line 5
Database error: Page (3:3863) is marked with the wrong type in PFS page (1:1). PFS status 0x40 expected 0x60.
Msg 8948, Level 16, State 6, Line 5
Database error: Page (3:3864) is marked with the wrong type in PFS page (1:1). PFS status 0x40 expected 0x60.
CHECKDB found 2 allocation errors and 0 consistency errors not associated with any single object.
CHECKDB found 2 allocation errors and 0 consistency errors in database 'MyProdDB'.
repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (MyProdDB).

I’ve discussed with the SQL Server team and this is a known bug in SQL Server 2014.

The problem can occur if an ALTER INDEX … REORGANIZE is performed in a transaction and then rolled back, one of the affected extents can have some of its pages marked with the wrong PFS status. This state is valid in SQL Server 2014, but if one of the upgrade steps happens to move one of these pages, DBCC CHECKDB on the new version will complain with the errors above.

Note: this is not a bug in DBCC CHECKDB :-)

The fix for this issue is to run DBCC CHECKDB (yourdb, REPAIR_ALLOW_DATA_LOSS) and that will fix the PFS state. From anecdotal evidence, you might need to run repair twice. Repair will simply fix the PFS status, not deallocate/delete anything.

If you experience this issue, the SQL Server team requests that you contact CSS so they know how many people are hitting the issue and they may ask for access to the database to aid with developing a fix.

I’ll update this post when I get more information – at present (9/26/18) there is no fix available apart from running repair.

 

Bug: database/server ‘shutdown due to error 3314’

[Edit 3/12/2019: it looks like this bug, or a similar one, is also in SQL Server 2016. I’ll post an update once I have it.]

Over the years I’ve discussed log space reservation, which is when SQL Server automatically reserves some free space in the transaction log so that in-flight transactions can always be rolled back without the log having to grow. This is because rolling back a transaction requires generating more log records, and so there needs to be guaranteed space for them. If this did not happen, the log could fill up, in-flight transactions would begin to roll back, a log grow attempt might fail, and the database then goes suspect or into recovery because the in-flight transactions are essentially stuck.

The algorithm is pretty conservative, and I fixed a few bugs in it before SQL Server 2005 shipped.

There hasn’t been a case of it failing to reserve enough space until SQL Server 2012, when a bug was introduced. That bug was discovered by someone I was working with in 2015 (which shows just how rare the circumstances are), and at the time it was thought that the bug was confined to the log of tempdb filling up, rollback failing, and the server shutting down.

However, just last week I was contacted by someone running SQL Server 2012 SP3 who’d seen similar symptoms but for a user database this time, and the user database went into recovery. An example of the error log messages is below (altered for anonymity):

During under of a logged operation in database 'mydb', an error occurred at log record ID (2445:89001:23). Typically, the specific failure is logged previously as an error in the Windows Event Log service. Restore the database or file from a backup, or repair the database.
The log for database 'mydb' is not available. Check the event log for related error messages. Resolve any errors and restart the database.
Error during rollback, shutting down database (location: 1)
Database mydb was shutdown due to error 3314 in routine 'XdesRMReadWrite::RollbackToLsn'. Restart for non-snapshot databases will be attempted after all connections to the database are aborted.
The transaction log for database 'mydb' is full due to 'ACTIVE_TRANSACTION'.
The transaction log for database 'mydb' is full due to 'ACTIVE_TRANSACTION'.
The transaction log for database 'mydb' is full due to 'ACTIVE_TRANSACTION'.
'D:\Logs\mydb.ldf: Operating system error 112(There is not enough space on the disk.) encountered.

I suggested that they’d hit the known bug, and they confirmed that with Microsoft.

And if you hit it for tempdb, the server will shut down, as tempdb being unavailable means SQL Server has no choice but to stop.

The bug is described in KB article 2963384 and is included in SQL Server 2012 SP4 and SQL Server 2014 SP1. If you’re running 2012 SP3 then you should install SP4, and if you’re running 2014 RTM then you should install the latest 2014 SP.

I didn’t blog about the bug back in 2015 as only one person had hit it and the circumstances seemed incredibly rare, but now that seems to not be the case.

Stay safe out there!

Bug: Error: 3449 and server restart during DBCC CHECKDB

This is a quick post to let you know about a bug that a few people are hitting when running DBCC CHECKDB. The symptoms are a series of errors in the error log plus SQL Server forcibly shuts itself down and restarts. I’ve heard of people hitting the bug on SQL Server 2014 and SQL Server 2012 SP1.

[Update 2/24/15] Microsoft has confirmed that it’s a bug in 2014 and 2012 and they’re planning a fix for the next CU of both.

[Update 3/18/15] The fix for SQL Server 2012 is in CU5 of SP2 and is described in KB 3044958.

The error log symptoms look something like I show below:

2014-11-18 09:04:15.69 spid64      The operating system returned error 665(The requested operation could not be completed due to a file system limitation) to SQL Server during a write at offset 0x00001EE6FD8000 in file 'C:\SQLskills\\Company_file2.ndf:MSSQL_DBCC23'. Additional messages in the SQL Server error log and system event log may provide more detail. This is a severe system-level error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.
2014-11-18 09:04:15.72 spid64      Error: 3314, Severity: 17, State: 3.
2014-11-18 09:04:15.72 spid64      During undoing of a logged operation in database 'Company', an error occurred at log record ID (887654:3321:14). Typically, the specific failure is logged previously as an error in the Windows Event Log service. Restore the database or file from a backup, or repair the database.
2014-11-18 09:04:34.38 spid64      Error: 831, Severity: 20, State: 1.
2014-11-18 09:04:34.38 spid64      Unable to deallocate a kept page.
2014-11-18 09:04:34.40 spid64      Error: 3449, Severity: 21, State: 1.
2014-11-18 09:04:34.40 spid64      SQL Server must shut down in order to recover a database (database ID 23). The database is either a user database that could not be shut down or a system database. Restart SQL Server. If the database fails to recover after another startup, repair or restore the database.

The 665 error is from the snapshot file that DBCC CHECKDB creates hitting an NTFS limitation on the number of file fragments in a sparse file. This causes the snapshot creation to fail. The failure causes the undo of a log record in the snapshot to fail (remember that a database snapshot undergoes crash recovery to make it transactionally consistent). This failure then leads to SQL Server thinking it has to forcibly restart to recover the snapshot database, which is should never do for a snapshot – and that’s the bug.

I’ll update this post as soon as I hear about the builds that the fix is in.

This is a rare bug to hit, but it’s a regression (from builds people are reporting), and you can help yourself to avoid it by:

  • Creating your own database snapshot, on a volume without file-system free space fragmentation, and running DBCC CHECKDB against the snapshot
  • OR, trying to run DBCC CHECKDB when there isn’t a significant amount of change occurring in the database, so the database snapshot doesn’t have to become very large
  • OR, using the backup-copy-restore-check method of running DBCC CHECKDB on a restored backup of the database on another server

Thanks