The Curious Case of… the 8060-byte row size limit

(The Curious Case of… used to be part of our bi-weekly newsletter but we decided to make it a regular blog post instead so it can sometimes be more frequent. It covers something interesting one of us encountered when working with a client, doing some testing, or were asked in a random question from the community.)

Last week I had a question from someone who wondered about the 8060-byte row size limit and why data file pages sometimes show more than 8,060 bytes free when looked at using DBCC PAGE.

Before I explain, let me make it clear that the 8,060-byte limit is just for the part of the record that’s stored on the data or index page in the ‘in-row data’ allocation unit; a record can have many LOB columns that are stored off-row in a ‘LOB data’ allocation unit (e.g. varchar(max) or FILESTREAM columns) and/or non-LOB, variable-length columns that have been pushed off-row to be stored in a ‘row overflow’ allocation unit (e.g. char(1-8000), nchar(1-4000), or sqlvariant columns), and so the actual row size is essentially unlimited.

I could rephrase the original question as this: given that a page is 8,192 bytes, and only 96 bytes are used for the page header, why is the in-row record size limited to 8,060 bytes? What about the other 36 bytes of available space?

The answer is that an 8,060-byte row can actually take up a lot more space.

Firstly, there’s the 2-byte row offset that’s stored at the end of the page and tells the Storage Engine where the row starts on the page. So that’s 8,062 bytes used, 34 bytes remaining.

Secondly, if the row is on a heap data page, and it’s a forwarded record, there’s a 10-byte back-pointer (to the original location of the record) tacked on to the end of the record. This is used in case the record has to move again, and then the forwarding stub  record (in the original location of the record) is updated to point to the new location of the forwarded record. So in that case, that’s 8,072 bytes used, 24 bytes remaining.

Thirdly, if the row has a prior version, because it’s been updated since either read committed snapshot isolation or snapshot isolation were enabled for the database, it will have a 14-byte versioning tag tacked on to the end of the record, containing the location of the previous version of the record in the version store in tempdb and the versioning timestamp when the current version of the record was created. So in that case, that’s 8,076 bytes used, 20 bytes remaining.

Lastly, it may be a heap forwarded record that also has a prior version, so has the 10-byte back-pointer and the 14-byte versioning tag tacked on the end. So in that case, that’s 8,086 bytes used, 10 bytes remaining.

And those 10 bytes are available for any future features to use. Some of them are probably used by the on-page versioning that Accelerated Database Recovery does in SQL Server 2019, but I haven’t investigated that in any depth yet.

So there you have it: an 8,060 byte record may actually need 8,086 bytes or more of physical storage because of various Storage Engine features.

The Curious Case of… whether corruption can propagate to secondary databases

(The Curious Case of… used to be part of our bi-weekly newsletter but we decided to make it a regular blog post instead so it can sometimes be more frequent. It covers something interesting one of us encountered when working with a client, doing some testing, or were asked in a random question from the community.)

We’ve had a few new clients come to us recently after experiencing corruption, and they’ve been worried about whether physical corruption can propagate to secondary databases (like an availability group secondary or log shipping secondary) through the mechanism used to maintain the secondary database in sync with the primary database. I explained how it’s very rare for that to happen, but sometimes it does, and not in a way you’d think. Read on…

Physical corruptions are mostly caused by something in the I/O subsystem – all the hardware and software underneath SQL Server, including the OS and filter drivers. I literally mean *anything* under SQL Server’s buffer pool in the I/O stack. Corruptions can happen in the data files or the log file.

Secondary copies are maintained by shipping log blocks, containing log records, and replaying the log records. For availability groups and database mirroring, there’s a constant copy-from-primary-and-replay-on-secondary process going on, and for log shipping, it’s a back-up-log-copy-restore process – but it’s essentially the same thing: replaying log records.

Apart from the initial full backup that is taken of the primary database and used to create the secondary database, data file pages are never subsequently copied to a secondary database when using availability groups or mirroring. They may be copied during log shipping, if an index operation is performed in the bulk-logged recovery model. Note that if the initial full backup backs up a corrupt database, the secondary copies initialized from that full backup will of course be corrupt in the same way. But that’s not what we’re talking about here.

Let’s assume that the database in question was created with page checksums enabled, as it’s been the default for 15 years now.

For corruptions caused by the I/O subsystem, there are several cases to consider:

  1. Data file page corrupted. When the page is read back into memory, the page checksum will be discovered to be invalid and an 824 error will result. The only possible way a data file page can be copied to a secondary is in the log shipping case I explained above, and as long as the BACKUP LOG statement uses the CHECKSUM option, the corruption will be discovered and the backup will fail.
  2. Log block corrupted. The log reader will notice the block is corrupt when reading it, or the log replay on the secondary will fail.
  3. Log backup is corrupted before replay on a secondary. As long as the BACKUP LOG and RESTORE LOG statement both use the CHECKSUM option, this will be discovered and the restore will fail.

Basically, a corruption caused by the I/O subsystem will not propagate, as long as page checksums are enabled.

THIS IS WHY YOU CAN’T OFFLOAD CONSISTENCY CHECKS TO A SECONDARY COPY!

Yes, I shouted that. Running DBCC CHECKDB on a secondary copy tells you nothing about the primary copy, or any other secondary copy. You have to consistency check *all* copies of the database.

Now let me worry you a bit…

There is a way for corruption to propagate to all copies – if a page gets physically corrupted in the buffer pool by bad memory or a SQL Server bug, it will then be written to disk with a valid page checksum. When it’s read back in to memory, the page checksum will still be valid, and the page will still contain the corruption. If that corruption happens to be in a column value, and that column value is used to calculate *another* value, which is persisted in the database, that incorrect value will be logged, and replayed on all the secondary copies.

I call this ‘second-order corruption propagation’. I’ve never seen it in the wild, but it’s theoretically possible.

Is there anything you can do about the possibility of it? No. You might think of ECC memory chips and error 832 from SQL Server 2012 onward, but that’s only checking whether *unchanged* pages in the buffer pool have been corrupted by bad memory. Once a page is changed in the buffer pool, the 832 checks don’t occur for that page.

But the chances of this happening in such a way that it is able to propagate are vanishingly small, so I wouldn’t worry about it.

Bottom line: my answer when asked whether physical corruption can propagate is this: as long as you’re using page checkums and backup/restore checksums, no.

PS Logical corruption is a different matter. If SQL Server messes up something like an index rebuild or an indexed view, that logical corruption will absolutely flow through to the secondary copies. You have to run DBCC CHECKDB everywhere!

The Curious Case of… restoring a table into a different database

(The Curious Case of… used to be part of our bi-weekly newsletter but we decided to make it a regular blog post instead so it can sometimes be more frequent. It covers something interesting one of us encountered when working with a client, doing some testing, or were asked in a random question from the community.)

On an email distribution list I’m on, someone wondered today about the possibility of restoring a table into a different database that already has data in it. It’s an interesting thought exercise, so I wrote up a response for the list but then figured it would make a neat blog post.

Restoring a table into a different database that already has data in is actually a lot harder than it may seem.

There are a number of problems, including:

  1. What if some of the allocation unit IDs used by the table are already in use in the new database?
  2. What if some of the page IDs used by the table are already in use in the new database?
  3. What if the table is involved in foreign-key constraints?
  4. What if the table contains FILESTREAM data and some of the FILESTREAM GUIDs are already in use?
  5. It would have to pull out any UDTs and other metadata related to the table

I can think of ways around these, but #2 is very challenging, as it would involve changing:

  • Page locations themselves, so updating the page ID in the header
  • Page linkages in the doubly-linked list at each index level
  • Page linkages in index non-leaf pages
  • Page linkages in heap forwarding pointers, and the back-pointer in each forwarded record
  • Page linkages to off-row text pages
  • Page linkages within text trees
  • IAM page contents
  • Keeping a mapping of old page ID to new page ID so the transaction log can be applied, and IAM page updates and would be especially difficult, as would LOP_MODIFY_COLUMNS (and anything touching the page header) as it would have to figure out whether a page pointer is being set or changed
    Doing all this in such a way that if the restore is interrupted, the database isn’t irretrievably corrupt

Restore of a single table into a new database would be a lot easier, but still challenging if you want to reduce disk space, unless you restore into the sparse equivalent of a database snapshot (otherwise if you have a table with, say, 10GB, but one of the pages is at offset 400GB in a file, you’d need 400GB of space for the file).

Bottom line: I wouldn’t hold your breath for either of these being implemented in SQL Server!

PS: Brent pointed out that non-SQL Server solutions like LiteSpeed do single table restore into a different database, which I didn’t know – I though they only did that into the same database, and from reading the docs, they don’t do what I said above either, as it would unfeasible. They extract the table pages from the backup into a separate location, and then use the table metadata to essentially ‘select’ from the temporary location and insert into a new SQL Server table, which avoids all the messy page ID manipulation. Pretty cool!