Inside the Storage Engine: Anatomy of a record

This week I’m going to post a bunch of info on the basic structures used to store data and track allocations in SQL Server. A bunch of this was posted back when I started blogging at TechEd 2006 but I want to consolidate/clarify info and add more about using DBCC PAGE to examine the various structures.

So, what are records? At the simplest level, a record is the physical storage associated with a table or index row. Of course, it gets much more complicated than that…

Data records

  • Data records are stored on data pages.
  • Data records store rows from a heap or the leaf level of a clustered index.
  • A data record always stores all columns from a table row – either by-value or by-reference.
    • If any columns are for LOB data types (text, ntext, image, and the new LOB types in SQL Server 2005 – varchar(max), nvarchar(max), varbinary(max), XML), then there’s a pointer stored in the data record which points to a text record on a different page (the root of a loose tree that stores the LOB value). Exceptions to this are when the schema has been set to store LOB columns ‘in-row’ when possible. This is when a LOB value is small enough to fit within the size limits of a data record. This is a performance benefit as selecting the LOB column does not require an extra IO to read the text record.
    • In SQL Server 2005, non-LOB variable length columns (e.g. varchar, sqlvariant) may also be stored ‘off-row’ as part of the row-overflow feature of having table rows longer than 8060 bytes. In this case the storage format is the same as for LOB values – a pointer in the data record pointing to a text record.
  • There is a difference to how the columns are laid out between heaps and clustered indexes
    • In a heap, the columns are laid out in the order specified in the CREATE TABLE statement (obviously divided into fixed-length and variable-length)
    • In a clustered index, the cluster keys become the first physical columns in the record, followed by the rest in the order specified in the CREATE TABLE statement

Forwarded/Forwarding records

  • These are technically data records and are only present in a heap.
  • A forwarded record is a data record in a heap that was updated and was too large to fit in-place on its original page and so has been moved to another page. It contains a back-pointer to the forwarding record.
  • A forwarding record is left in its place and points to the new location of the record. It’s sometimes known as a forwarding-stub, as all it contains is the location of the real data record.
  • This is done to avoid having to update any non-clustered index records that point back directly to the original physical location of the record.
  • Although this optimizes non-clustered index maintenance during updates, it can cause additional I/Os during SELECTs. This is because the non-clustered index record points to the old location of the index, so an extra IO might be needed to read the real location of the data row. This is fuel for the heap vs clustered index debate, in favor of clustered indexes.

Index records

  • Index records are stored on index pages.
  • There are two types of index records (which differ only in what columns they store):
    1. Those that store non-clustered index rows at the leaf level of a non-clustered index
    2. Those that comprise the b-tree that make up clustered and non-clustered indexes (i.e. in index pages above the leaf level of a clustered or non-clustered index)
  • I’ll explain more about the differences between these in a later post as it can be quite complicated (especially the differences between SQL Server 2000 and 2005) and is worth doing in separate posts.
  • Index records typically do not contain all the column values in a table (although some do – called covering indexes).
  • In SQL Server 2005, non-clustered index records can include LOB values as included columns (with the storage details exactly the same as for data records) and also can have row-overflow data that is pushed off-row (again, in exactly the same way as for data records).

Text records

  • Text records are stored on text pages.
  • There are various types of text records that comprise the tree structure that stores LOB values, stored on two types of text page. I’ll explain how they work and are linked together in a future post.
  • They are also used to store variable-length column values that have been pushed out of data or index records as part of the row-overflow capability.

Ghost records

  • These are records that have been logically deleted but not physically deleted from a page. The reasons for this are complicated, but basically having ghost records simplifies key-range locking and transaction rollback.
  • The record is marked with a bit that indicates it’s a ghost record and cannot be physically deleted until the transaction that caused it to be ghosted commits. Once this is done, it is deleted by an asynchronous background process (called the ghost-cleanup task) or it is converted back to a real record by an insert of a record with the exact same set of keys.

Other record types

  • There are also records that are used to store various allocation bitmaps, intermediate results of sort operations, and file and database metadata (e.g. in the per-file fileheader page and per-database boot page). Again, I’ll go into these in later posts (there’s a big queue of posts building up :-))

Record structure

All records have the same structure, regardless of their type and use, but the number and type of columns will be different. For instance, a data record from a table with a complex schema may have hundreds of columns of various types whereas an allocation bitmap record will have a single column, filling up the whole page.

The record structure for non-compressed records is as follows:

  • Record header
    • 4 bytes long
    • Two bytes of record metadata (record type)
    • Two bytes pointing forward in the record to the null bitmap
  • Fixed length portion of the record, containing the columns storing data types that have fixed lengths (e.g. bigint, char(10), datetime)
  • Null bitmap
    • Two bytes for count of columns in the record
    • Variable number of bytes to store one bit per column in the record, regardless of whether the column is nullable or not (this is different and simpler than SQL Server 2000 which had one bit per nullable column only)
    • This allows an optimization when reading columns that are NULL
  • Variable-length column offset array
    • Two bytes for the count of variable-length columns
    • Two bytes per variable length column, giving the offset to the end of the column value
  • Versioning tag
    • This is in SQL Server 2005 only and is a 14-byte structure that contains a timestamp plus a pointer into the version store in tempdb

NULL bitmap optimization

So why is the null bitmap an optimization?

Firstly, having a null bitmap removes the need for storing special NULL values for fixed-length datatypes. Without the null bitmap, how can you tell whether a column is NULL? For fixed-length columns you’d need to define a special NULL value, which limits the effective range of the datatype being stored. For varchar columns, the value could be a zero-length empty string, so just checking the length doesn’t work – you’d need the special value again. For all other variable-length data types you can just check the length. So, we need the NULL bitmap.

Secondly, it saves CPU cycles. If there was no null bitmap, then there are extra instructions executed for fixed-length and variable-length columns.

For fixed-length:

  1. Read in the stored column value (possibly taking a CPU data cache miss)
  2. Load the pre-defined NULL value for that datatype (possibly taking a CPU data cache miss, but only for the first read in the case of a multiple row select)
  3. Do a comparison between the two values

For variable-length:

  1. Calculate the offset of the variable length array
  2. Read the number of variable length columns (possibly taking a CPU data cache miss)
  3. Calculate the position in the variable length offset array to read
  4. Read the column offset from it (possibly taking a CPU data cache miss)
  5. Read the next one too (possibly taking another CPU data cache miss, if the offset in step 4 was on the boundary of a cache line size)
  6. Compare them to see if they’re the same

But with a null bitmap, all you have to do is:

  1. Read the null bitmap offset (possibly taking a CPU data cache miss)
  2. Calculate the additional offset of the NULL bit you want to read
  3. Read it (possibly taking a CPU data cache miss)

So, its about even for a lookup of a single fixed-length column, but for variable-length columns, and for multiple row selects, there’s a clear advantage to having the NULL bitmap.

Using DBCC IND and DBCC PAGE to examine a row in detail

Let’s create an example table to look at:

USE [master];

IF DATABASEPROPERTY (N'recordanatomy', 'Version') > 0 DROP DATABASE [RecordAnatomy];
CREATE DATABASE [RecordAnatomy];

USE [RecordAnatomy];

CREATE TABLE [example] ([destination] VARCHAR(100), [activity] VARCHAR(100), [duration] INT);

INSERT INTO [example] VALUES ('Banff', 'sightseeing', 5);
INSERT INTO [example] VALUES ('Chicago', 'sailing', 4);

And we can use DBCC IND again to find the page to look at:

DBCC IND ('recordanatomy', 'example', 1);

Remember we need the trace-flag to make the DBCC PAGE output go to the console instead of the error log. The output will contain something like the following:

Slot 0 Offset 0x60 Length 33

Record Type = PRIMARY_RECORD         Record Attributes =  NULL_BITMAP VARIABLE_COLUMNS

Memory Dump @0x5C76C060

00000000:   30000800 05000000 0300f802 00160021 †0..............!
00000010:   0042616e 66667369 67687473 6565696e †.Banffsightseein
00000020:   67†††††††††††††††††††††††††††††††††††g

Slot 0 Column 0 Offset 0x11 Length 5

destination = Banff

Slot 0 Column 1 Offset 0x16 Length 11

activity = sightseeing

Slot 0 Column 2 Offset 0x4 Length 4

duration = 5

Let’s use the record structure I listed above to go through this record and see how things are stored.

  • Byte 0 is the TagA byte of the record metadata.
    • It’s 0x30, which corresponds to 0x10 (bit 4) and 0x20 (bit 5). Bit 4 means the record has a null bitmap and bit 5 means the record has variable length columns. If 0x40 (bit 6) was also set, that would indicate that the record has a versioning tag. If 0x80 (bit 7) was also set, that would indicate that byte 1 has a value in it.
    • Bits 1-3 of byte 0 give the record type. The possible values are:
      • 0 = primary record. A data record in a heap that hasn’t been forwarded or a data record at the leaf level of a clustered index.
      • 1 = forwarded record
      • 2 = forwarding record
      • 3 = index record
      • 4 = blob fragment
      • 5 = ghost index record
      • 6 = ghost data record
      • 7 = ghost version record. A special 15-byte record containing a single byte record header plus a 14-byte versioning tag that is used in some circumstances (like ghosting a versioned blob record)
    • In our example, none of these bits are set which means the record is a primary record. If the record was an index record, byte 0 would have the value 0x36. Remember that the record type starts on bit 1, not bit 0, and so the record type value from the enumeration above needs to be shifted left a bit (multiplied by two) to get its value in the byte.
  • Byte 1 is the TagB byte of the record metadata. It can either be 0x00 or 0x01>. If it is 0x01, that means the record type is ghost forwarded record. In this case it’s 0x00, which is what we expect given the TagA byte value.
  • Bytes 2 and 3 are the offset of the null bitmap in the record. This is 0x0008 (multi-byte values are stored as least-significant byte first). This means that there’s a 4-byte fixed length portion of the record starting at byte 4. We expect this because we know the table schema.
  • Bytes 4 to 7 are the fixed length portion. Again, because we know the table schema, we know to interpret these bytes as a 4-byte integer. Without that knowledge, you’d have to guess. The value therefore is 0x00000005, which is what we’d expect to see as the value of the duration column.
  • Bytes 8 and 9 are the count of columns in the record. This is 0x0003 which is correct. Given that there are only 3 columns, the null bitmap of one bit per column will fit in a single byte.
  • Byte 10 is the null bitmap. The value is 0xF8. We need to convert it to binary to make sense of the value. 0xF8 = 11111000 in binary. This makes sense – bits 0-2 represent columns 1-3 and they’re all 0, meaning the columns aren’t null. Bits 3-7 represent non-existent columns and they’re set to 1 for clarity.
  • Bytes 11 and 12 are the count of variable length columns in the record. That value is 0x0002, which we again know to be correct. This means there will be two two-byte entries in the variable length column offset array. These will be bytes 13-14 and 15-16, having values of 0x0016 and 0x0021 respectively. Remember that variable length column offset array entries point to the end of the column value – this is done so that we know how long each column is without having to store their length as well.
  • So, the final offset is bytes 15 and 16, which means the offset of the start of the first variable length column must be byte 17 (or 0x11 in hex), which agrees with the DBCC PAGE dump. The offset of the end of the first variable length column is 0x0016, so the first value is from byte 17 to byte 21 inclusive. This value is 0x42616E6666. We know from the table metadata that this is the first varchar column, destination. Converting to ASCII gives us the column value ‘Banff’. Using similar logic, the second value is from byte 22 to byte 32 inclusive and has the value ‘sightseeing’. Both of these match the data we’re expecting.

And that’s it.

Search Engine Q&A #2: Moving a database while Database Mirroring is running

This was a question from the MSDN Disaster Recovery forum I started while I was at Microsoft.

I have a 600 gig database that has a mirror. I need to move the databases from local drives to a SAN. Can anyone recommend a document that lists the steps to go through to move both the principle and mirror to the SAN with no down time? or minimal down time?

As far as I know, there isn’t any such document so I had a crack at coming up with a list of operations. Here’s what I had:

  1. Take a full backup of the principal on node A
  2. Restore it on the SAN on node B using WITH NORECOVERY, remembering to use WITH MOVE to place the files correctly, and with a different database name than the current mirror
  3. Take the required log backup on the principal and restore on the database copy on the SAN on node B
  4. Break the mirroring partnership
  5. Drop the current mirror database on node B
  6. Rename the database on the SAN on node B to be the mirror database — THIS DOESN”T WORK!
  7. Setup the mirroring partnership to point to the newly restored database on the SAN on node B
  8. Start mirroring and the new mirror will catch-up
  9. Failover to the mirror on node B, which becomes the new principal
  10. Follow the same procedure to move the new mirror on node A onto its SAN
  11. Failback if you want to

And I promised to try it out to make sure I had it right so in this blog post I’m going to walk through the steps of doing this. It turns out that the steps above are slightly incorrect. Step 6 above doesn’t work because the database is in recovery (so is inaccessible) and there’s a short-cut when moving the database on the first node to avoid having to take and copy more backups. Let’s see how it works and I’ll post the corrected sequence at the end.

As I did in yesterday’s mirroring post, I’m going to use the TicketSalesDB database from our Always-On DVDs. It’s only a few hundred MB instead of 600GB but the principal is the same (no pun intended :-)). I’ve got mirroring running between two nodes, SQLDEV01 (the principal) and SQLDEV02 (the mirror), both of which are running 2005 SP2 and I’ve got a simulated workload inserting rows into the database. I don’t actually have a SAN laying around so I’m cheating and I have directories called C:\SQLDEV01SAN and C:\SQLDEV02SAN instead. It’s the location change that’s the interesting part, not where the actual location is.

Step 1

On SQLDEV01, take a full backup and a log backup:


BACKUP LOG [TicketSalesDB] TO DISK = N'C:\SQLskills\TicketSalesDB_Log.bak' WITH INIT;

Step 2

On SQLDEV01, break the mirroring partnership:


And just check that it’s gone:

SELECT [mirroring_state_desc] FROM sys.database_mirroring WHERE [database_id] = DB_ID (N'TicketSalesDB');

Step 3

On SQLDEV02, drop the mirror database – this wouldn’t work unless mirroring was no longer running:


Step 4

Copy the backups to SQLDEV02 and restore them on the SAN and remembering to use WITH NORECOVERY:

RESTORE DATABASE [TicketSalesDB] FROM DISK = N'C:\SQLskills\TicketSalesDB.bak'
    MOVE N'TicketSalesDBData' TO N'C:\SQLDEV02SAN\TicketSalesDBData.MDF',
    MOVE N'TicketSalesFG2005Q1' TO N'C:\SQLDEV02SAN\TicketSalesFG2005Q1.NDF',
    MOVE N'TicketSalesFG2005Q2' TO N'C:\SQLDEV02SAN\TicketSalesFG2005Q2.NDF',
    MOVE N'TicketSalesFG2005Q3' TO N'C:\SQLDEV02SAN\TicketSalesFG2005Q3.NDF',
    MOVE N'TicketSalesFG2005Q4' TO N'C:\SQLDEV02SAN\TicketSalesFG2005Q4.NDF',
    MOVE N'TicketSalesDBLog' TO N'C:\SQLDEV02SAN\TicketSalesDBLog.LDF',

Step 5

On SQLDEV02, set the mirroring partner to be SQLDEV01:


Step 6

On SQLDEV01, start mirroring:


And check that it’s running:

SELECT [mirroring_state_desc] FROM sys.database_mirroring WHERE [database_id] = DB_ID (N'TicketSalesDB');

This time it returns:


Step 7

Now we need to failover so that we can move the database on SQLDEV01 onto its SAN. Before we do that, let’s make sure that SQLDEV01 is the principal:

SELECT [mirroring_role_desc] FROM sys.database_mirroring WHERE [database_id] = DB_ID (N'TicketSalesDB');

which returns:


Now force the failover:


And query the DMV again to make sure. This time the mirroring_state_desc returned is:



Now, I did all of this while my workload was running and it automatically failed over to SQLDEV02, with the database now hosted on the SAN. To do the same move on SQLDEV01, we don’t need to go through the backup and copy process again – we can just use the original backups we took in step 1.

Step 8

We need to break the mirroring partnership again, this time executing on SQLDEV02, the new principal:


On SQLDEV01, we can now drop the database and restore the original backups onto the SAN:


RESTORE DATABASE [TicketSalesDB] FROM DISK = N'C:\SQLskills\TicketSalesDB.bak'
    MOVE N'TicketSalesDBData' TO N'C:\SQLDEV01SAN\TicketSalesDBData.MDF',
    MOVE N'TicketSalesFG2005Q1' TO N'C:\SQLDEV01SAN\TicketSalesFG2005Q1.NDF',
    MOVE N'TicketSalesFG2005Q2' TO N'C:\SQLDEV01SAN\TicketSalesFG2005Q2.NDF',
    MOVE N'TicketSalesFG2005Q3' TO N'C:\SQLDEV01SAN\TicketSalesFG2005Q3.NDF',
    MOVE N'TicketSalesFG2005Q4' TO N'C:\SQLDEV01SAN\TicketSalesFG2005Q4.NDF',
    MOVE N'TicketSalesDBLog' TO N'C:\SQLDEV01SAN\TicketSalesDBLog.LDF',

RESTORE LOG [TicketSalesDB] FROM DISK = N'C:\SQLskills\TicketSalesDB_Log.bak' WITH NORECOVERY;

And setup mirroring again. On SQLDEV01:


And on SQLDEV02:


And we’re running again.

Step 9

Now all we need to do is fail the workload back to SQLDEV01 by executing this on SQLDEV02:



So – the corrected sequence for moving a database while mirroring is running is the following:


  1. Take a full backup of the principal database on node A, and the required log backup
  2. Break the mirroring partnership
  3. Drop the current mirror database on node B
  4. Copy the backups to node B and restore it on the SAN on node B using WITH NORECOVERY, remembering to use WITH MOVE to place the files correctly
  5. Setup the mirroring partnership to point to the newly restored database on the SAN on node B
  6. Start mirroring and the new mirror will catch-up
  7. Failover to the mirror on node B, which becomes the new principal
  8. Follow the same procedure to move the new mirror on node A onto its SAN, but using the original backups from step 1
  9. Failback

Hope this helps.

SQL Server 2008: Automatic Page Repair with Database Mirroring

One of the hottest features in SQL Server 2005 is database mirroring, and it’s helped many companies implement successful and relatively inexpensive high-availability strategies. In SQL Server 2008, Database Mirroring has been enhanced in several ways – one of which is the ability to automatically repair corrupt pages!

This feature is based on the fact that the principal and mirror databases are exactly the same. So, if a page becomes corrupt on the principal, SQL Server should be able to read the page from the mirror and use it to fix the principal. Similarly, if a page becomes corrupt on the mirror, the page can be read from the principal to fix up the mirror. Pretty cool, eh?


  • The feature works for pages that have 824 errors, 823 errors where the OS returns a CRC error while reading the page (to prevent resource issues triggering a page repair), and pages that have 829 errors (where the page is marked as restore pending). See my previous post here for more details on page errors.
  • Pages are fixed asynchronously.
    • If the page is corrupt in the principal, the query that hit the corrupt page will fail. Once discovered, a page is marked as being 829 until its fixed. This prevents an issue where a transient disk error could allow a subsequent update to change the page after it’s been queued for being repaired, and then the page is overwritten with a copy from the mirror, losing the update. Nasty.
    • If the page is corrupt in the mirror (which is discovered when the page is read as part of the continual recovery of the log), the mirroring session is suspended. The mirror keeps track of all corrupt pages that need to be repaired with copies from the principal. Once all corrupt pages have been repaired, the mirroring session will be resumed automatically. This means that if a page is corrupt in both the mirror and the principal, manual intervention will be required to resolve the issue.
  • The feature is available in Standard and Enterprise Editions.
  • There is a new DMV – sys.dm_db_mirroring_auto_page_repair – that allows you to track corrupt pages in mirrored databases
    • It covers all mirrored databases on a server.
    • It provides info on the last 100 pages that were found in any mirrored database, as well as the status of the automatic page repair operation.
    • This DMV isn’t yet included in the SQL Server 2008 Books Online available on TechNet but will have the following info:
      • Database ID the page is in
      • The Page ID, split into file and page-in-file
      • The error type – distinguishing between 823 errors, torn-page errors, page checksum failures, and all-other-824 errors
      • The status of the page repair operation
      • The time that the status was last updated
  • If a page repair fails for any reason (e.g. the mirroring partner couldn’t supply the page) then the repair will be marked as failed. If the page is then hit again (by a query on the principal or a recovery operation on the mirror) then it will be re-queued for repair.
  • Not all pages can be repaired – the file header page, database boot page, and allocation bitmap pages (GAM, SGAM, PFS) cannot be repaired this way.

In Action

I’ve got a system with a few SQL Server 2008 instances running so I decided to give it a try. Here’s what I did, using the TicketSalesDB from the Always-On DVDs and labs that SQLskills produces:

  • Changed the database to use page checksums
  • Rebuilt the clustered index of one of the tables (so the pages have page checksums on)
  • Took the initial backup of the database that’s needed for mirroring (so that it has no corruptions in)
  • Used DBCC IND to find a page in the leaf level of the clustered index and corrupted the page
  • Setup mirroring (but using the backup I took before introducing the corruption – so the mirror database will have a clean copy of the page I corrupted)
  • Performed a query on the table with the corrupt page to force the page checksum failure and kick-off automatic page repair

And it worked! Going back into the instance with the principal database and querying the DMV gives:

SELECT * FROM sys.dm_db_mirroring_auto_page_repair;
database_id file_id     page_id              error_type page_status modification_time
----------- ----------- -------------------- ---------- ----------- -----------------------
6           4           4256                 -1         5           2007-09-27 17:23:20.067

The page I corrupted was (4:4256) and page_status of 5 means the repair succeeded. Running the query again confirms that the corruption has been fixed. The page was also logged in the suspect_pages table in msdb:

SELECT * FROM msdb..suspect_pages;
database_id file_id     page_id              event_type  error_count last_update_date
----------- ----------- -------------------- ----------- ----------- -----------------------
6           4           4256                 5           1           2007-09-27 17:23:20.407

In Books Online (both 2005 and 2008), event_type of 5 means that the page was repaired.

I checked the SQL Server error log for the principal and this is what I found:

2007-09-27 17:17:10.41 spid25s     Database mirroring is active with database 'TicketSalesDB' as the principal copy. This is an informational message only. No user action is required.
2007-09-27 17:23:19.92 spid51      Error: 824, Severity: 24, State: 2.
2007-09-27 17:23:19.92 spid51      SQL Server detected a logical consistency-based I/O error: incorrect checksum (expected: 0x55684fbe; actual: 0x16e84fbe). It occurred during a read of page (4:4256) in database ID 6 at offset 0x00000002140000 in file 'C:\Program Files\Microsoft SQL Server\MSSQL.6\MSSQL\Data\TicketSalesFG2005Q1.NDF'.  Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.
2007-09-27 17:23:19.96 spid26s     Database mirroring is attempting to repair physical page (4:4256) in database 'TicketSalesDB' by requesting a copy from the partner.
2007-09-27 17:23:20.42 spid26s     Database mirroring successfully repaired physical page (4:4256) in database 'TicketSalesDB' by obtaining a copy from the partner.

I also checked the error log for the mirror and there’s nothing relevant in there. I guess the same info would be output to the mirror database’s error log if the mirror became corrupted. It’s reasonably simple to check this using similar steps as above, but corrupting the database before the initial backup is taken, fixing the corruption again after the backup and before mirroring starts (so the corruption is on the mirror but not the principal), and then triggering an update on the corrupt page. When the update is replayed on the mirror, the corruption will be hit and the page repaired in the same way. I’ll check later and blog if there’s any difference.


SQL Server 2008 introduces an enhancement to database mirroring that can bi-directionally pull pages between the principal and mirror databases to fix page corruptions. One word of caution I’d give is that this feature doesn’t mean you can ignore these errors when they occur – you still need to do root-cause analysis on the corruption and take steps to prevent them happening again before a corruption occurs that automatic page repair cannot fix (as I mentioned above).

Nevertheless, this is a tremendously useful feature that’s going to save a lot of downtime. Cool!