Inside the Storage Engine: Anatomy of a record

This week I’m going to post a bunch of info on the basic structures used to store data and track allocations in SQL Server. A bunch of this was posted back when I started blogging at TechEd 2006 but I want to consolidate/clarify info and add more about using DBCC PAGE to examine the various structures.

So, what are records? At the simplest level, a record is the physical storage associated with a table or index row. Of course, it gets much more complicated than that…

Data records

  • Data records are stored on data pages.
  • Data records store rows from a heap or the leaf level of a clustered index.
  • A data record always stores all columns from a table row – either by-value or by-reference.
    • If any columns are for LOB data types (text, ntext, image, and the new LOB types in SQL Server 2005 – varchar(max), nvarchar(max), varbinary(max), XML), then there’s a pointer stored in the data record which points to a text record on a different page (the root of a loose tree that stores the LOB value). Exceptions to this are when the schema has been set to store LOB columns ‘in-row’ when possible. This is when a LOB value is small enough to fit within the size limits of a data record. This is a performance benefit as selecting the LOB column does not require an extra IO to read the text record.
    • In SQL Server 2005, non-LOB variable length columns (e.g. varchar, sqlvariant) may also be stored ‘off-row’ as part of the row-overflow feature of having table rows longer than 8060 bytes. In this case the storage format is the same as for LOB values – a pointer in the data record pointing to a text record.
  • There is a difference to how the columns are laid out between heaps and clustered indexes
    • In a heap, the columns are laid out in the order specified in the CREATE TABLE statement (obviously divided into fixed-length and variable-length)
    • In a clustered index, the cluster keys become the first physical columns in the record, followed by the rest in the order specified in the CREATE TABLE statement

Forwarded/Forwarding records

  • These are technically data records and are only present in a heap.
  • A forwarded record is a data record in a heap that was updated and was too large to fit in-place on its original page and so has been moved to another page. It contains a back-pointer to the forwarding record.
  • A forwarding record is left in its place and points to the new location of the record. It’s sometimes known as a forwarding-stub, as all it contains is the location of the real data record.
  • This is done to avoid having to update any non-clustered index records that point back directly to the original physical location of the record.
  • Although this optimizes non-clustered index maintenance during updates, it can cause additional I/Os during SELECTs. This is because the non-clustered index record points to the old location of the index, so an extra IO might be needed to read the real location of the data row. This is fuel for the heap vs clustered index debate, in favor of clustered indexes.

Index records

  • Index records are stored on index pages.
  • There are two types of index records (which differ only in what columns they store):
    1. Those that store non-clustered index rows at the leaf level of a non-clustered index
    2. Those that comprise the b-tree that make up clustered and non-clustered indexes (i.e. in index pages above the leaf level of a clustered or non-clustered index)
  • I’ll explain more about the differences between these in a later post as it can be quite complicated (especially the differences between SQL Server 2000 and 2005) and is worth doing in separate posts.
  • Index records typically do not contain all the column values in a table (although some do – called covering indexes).
  • In SQL Server 2005, non-clustered index records can include LOB values as included columns (with the storage details exactly the same as for data records) and also can have row-overflow data that is pushed off-row (again, in exactly the same way as for data records).

Text records

  • Text records are stored on text pages.
  • There are various types of text records that comprise the tree structure that stores LOB values, stored on two types of text page. I’ll explain how they work and are linked together in a future post.
  • They are also used to store variable-length column values that have been pushed out of data or index records as part of the row-overflow capability.

Ghost records

  • These are records that have been logically deleted but not physically deleted from a page. The reasons for this are complicated, but basically having ghost records simplifies key-range locking and transaction rollback.
  • The record is marked with a bit that indicates it’s a ghost record and cannot be physically deleted until the transaction that caused it to be ghosted commits. Once this is done, it is deleted by an asynchronous background process (called the ghost-cleanup task) or it is converted back to a real record by an insert of a record with the exact same set of keys.

Other record types

  • There are also records that are used to store various allocation bitmaps, intermediate results of sort operations, and file and database metadata (e.g. in the per-file fileheader page and per-database boot page). Again, I’ll go into these in later posts (there’s a big queue of posts building up :-))

Record structure

All records have the same structure, regardless of their type and use, but the number and type of columns will be different. For instance, a data record from a table with a complex schema may have hundreds of columns of various types whereas an allocation bitmap record will have a single column, filling up the whole page.

The record structure for non-compressed records is as follows:

  • Record header
    • 4 bytes long
    • Two bytes of record metadata (record type)
    • Two bytes pointing forward in the record to the null bitmap
  • Fixed length portion of the record, containing the columns storing data types that have fixed lengths (e.g. bigint, char(10), datetime)
  • Null bitmap
    • Two bytes for count of columns in the record
    • Variable number of bytes to store one bit per column in the record, regardless of whether the column is nullable or not (this is different and simpler than SQL Server 2000 which had one bit per nullable column only)
    • This allows an optimization when reading columns that are NULL
  • Variable-length column offset array
    • Two bytes for the count of variable-length columns
    • Two bytes per variable length column, giving the offset to the end of the column value
  • Versioning tag
    • This is in SQL Server 2005 only and is a 14-byte structure that contains a timestamp plus a pointer into the version store in tempdb

NULL bitmap optimization

So why is the null bitmap an optimization?

Firstly, having a null bitmap removes the need for storing special NULL values for fixed-length datatypes. Without the null bitmap, how can you tell whether a column is NULL? For fixed-length columns you’d need to define a special NULL value, which limits the effective range of the datatype being stored. For varchar columns, the value could be a zero-length empty string, so just checking the length doesn’t work – you’d need the special value again. For all other variable-length data types you can just check the length. So, we need the NULL bitmap.

Secondly, it saves CPU cycles. If there was no null bitmap, then there are extra instructions executed for fixed-length and variable-length columns.

For fixed-length:

  1. Read in the stored column value (possibly taking a CPU data cache miss)
  2. Load the pre-defined NULL value for that datatype (possibly taking a CPU data cache miss, but only for the first read in the case of a multiple row select)
  3. Do a comparison between the two values

For variable-length:

  1. Calculate the offset of the variable length array
  2. Read the number of variable length columns (possibly taking a CPU data cache miss)
  3. Calculate the position in the variable length offset array to read
  4. Read the column offset from it (possibly taking a CPU data cache miss)
  5. Read the next one too (possibly taking another CPU data cache miss, if the offset in step 4 was on the boundary of a cache line size)
  6. Compare them to see if they’re the same

But with a null bitmap, all you have to do is:

  1. Read the null bitmap offset (possibly taking a CPU data cache miss)
  2. Calculate the additional offset of the NULL bit you want to read
  3. Read it (possibly taking a CPU data cache miss)

So, its about even for a lookup of a single fixed-length column, but for variable-length columns, and for multiple row selects, there’s a clear advantage to having the NULL bitmap.

Using DBCC IND and DBCC PAGE to examine a row in detail

Let’s create an example table to look at:

USE [master];

IF DATABASEPROPERTY (N'recordanatomy', 'Version') > 0 DROP DATABASE [RecordAnatomy];
CREATE DATABASE [RecordAnatomy];

USE [RecordAnatomy];

CREATE TABLE [example] ([destination] VARCHAR(100), [activity] VARCHAR(100), [duration] INT);

INSERT INTO [example] VALUES ('Banff', 'sightseeing', 5);
INSERT INTO [example] VALUES ('Chicago', 'sailing', 4);

And we can use DBCC IND again to find the page to look at:

DBCC IND ('recordanatomy', 'example', 1);

Remember we need the trace-flag to make the DBCC PAGE output go to the console instead of the error log. The output will contain something like the following:

Slot 0 Offset 0x60 Length 33

Record Type = PRIMARY_RECORD         Record Attributes =  NULL_BITMAP VARIABLE_COLUMNS

Memory Dump @0x5C76C060

00000000:   30000800 05000000 0300f802 00160021 †0..............!
00000010:   0042616e 66667369 67687473 6565696e †.Banffsightseein
00000020:   67†††††††††††††††††††††††††††††††††††g

Slot 0 Column 0 Offset 0x11 Length 5

destination = Banff

Slot 0 Column 1 Offset 0x16 Length 11

activity = sightseeing

Slot 0 Column 2 Offset 0x4 Length 4

duration = 5

Let’s use the record structure I listed above to go through this record and see how things are stored.

  • Byte 0 is the TagA byte of the record metadata.
    • It’s 0x30, which corresponds to 0x10 (bit 4) and 0x20 (bit 5). Bit 4 means the record has a null bitmap and bit 5 means the record has variable length columns. If 0x40 (bit 6) was also set, that would indicate that the record has a versioning tag. If 0x80 (bit 7) was also set, that would indicate that byte 1 has a value in it.
    • Bits 1-3 of byte 0 give the record type. The possible values are:
      • 0 = primary record. A data record in a heap that hasn’t been forwarded or a data record at the leaf level of a clustered index.
      • 1 = forwarded record
      • 2 = forwarding record
      • 3 = index record
      • 4 = blob fragment
      • 5 = ghost index record
      • 6 = ghost data record
      • 7 = ghost version record. A special 15-byte record containing a single byte record header plus a 14-byte versioning tag that is used in some circumstances (like ghosting a versioned blob record)
    • In our example, none of these bits are set which means the record is a primary record. If the record was an index record, byte 0 would have the value 0x36. Remember that the record type starts on bit 1, not bit 0, and so the record type value from the enumeration above needs to be shifted left a bit (multiplied by two) to get its value in the byte.
  • Byte 1 is the TagB byte of the record metadata. It can either be 0x00 or 0x01>. If it is 0x01, that means the record type is ghost forwarded record. In this case it’s 0x00, which is what we expect given the TagA byte value.
  • Bytes 2 and 3 are the offset of the null bitmap in the record. This is 0x0008 (multi-byte values are stored as least-significant byte first). This means that there’s a 4-byte fixed length portion of the record starting at byte 4. We expect this because we know the table schema.
  • Bytes 4 to 7 are the fixed length portion. Again, because we know the table schema, we know to interpret these bytes as a 4-byte integer. Without that knowledge, you’d have to guess. The value therefore is 0x00000005, which is what we’d expect to see as the value of the duration column.
  • Bytes 8 and 9 are the count of columns in the record. This is 0x0003 which is correct. Given that there are only 3 columns, the null bitmap of one bit per column will fit in a single byte.
  • Byte 10 is the null bitmap. The value is 0xF8. We need to convert it to binary to make sense of the value. 0xF8 = 11111000 in binary. This makes sense – bits 0-2 represent columns 1-3 and they’re all 0, meaning the columns aren’t null. Bits 3-7 represent non-existent columns and they’re set to 1 for clarity.
  • Bytes 11 and 12 are the count of variable length columns in the record. That value is 0x0002, which we again know to be correct. This means there will be two two-byte entries in the variable length column offset array. These will be bytes 13-14 and 15-16, having values of 0x0016 and 0x0021 respectively. Remember that variable length column offset array entries point to the end of the column value – this is done so that we know how long each column is without having to store their length as well.
  • So, the final offset is bytes 15 and 16, which means the offset of the start of the first variable length column must be byte 17 (or 0x11 in hex), which agrees with the DBCC PAGE dump. The offset of the end of the first variable length column is 0x0016, so the first value is from byte 17 to byte 21 inclusive. This value is 0x42616E6666. We know from the table metadata that this is the first varchar column, destination. Converting to ASCII gives us the column value ‘Banff’. Using similar logic, the second value is from byte 22 to byte 32 inclusive and has the value ‘sightseeing’. Both of these match the data we’re expecting.

And that’s it.

19 thoughts on “Inside the Storage Engine: Anatomy of a record

  1. in "Record Structure" you write: "versioning tag – this is in SQL Server 2005 only and is a 14-byte structure that contains a timestamp plus a pointer into the version store in tempdb"
    I guess these 14 bytes are just consumed if row versioning is turned on for the database and the table?

  2. i have come accross instances where the slot array first entry is "60 01" or some times even "60 02"(hex editor) which should be 60 00 (as viewed in dbcc page dump).
    this offcourse means ,96 offset is where i can find my first record in the page. But as viewing it hex editor it gives some different offset(60 01).
    There are instances where while reading a record , in the " offset of the variable length array" section, it gives similar byte mismatch problem (In hex editor) this result it change in value of the offset an hence a particular variable length value is not traceable.

    I have’nt been able to figure out this.Can you please help me understand this.

  3. It totally depends where the first slot actually is on the page. There’s no guarantee that the first slot on the page will always be at offset 96 – what if that record was deleted, for instance?

  4. Hi Paul,
    i created a table with three columns, col1,col2 and col3 (all int). I am made col1 and col2 as my composite primary key.
    What i noticed is that, in record structure of this table, values of col2 and col3 came before than col1. Can You explain me this.


  5. I have a severely damaged database where several raid stripes have been zeroed out, damaging data and IAM chains for data and LOB. System files are accessible.
    For a table for which DBCC IND reveals a pointed IAM page zeroed out, is there any possibility to get a data record using SQL Server by providing its table name or object_id and its physical address (partition_number, file_number, page_number, slot_number)?
    What is of interest is to get the record by direct access to the pointed position, preventing the access to the IAM chains that are corrupted, and if used the command ends with error and no data is returned.

  6. Thank you for quick answer.
    I already used (using your posts) %%PHYSLOC%% in cursor to extract record by record data from tables. It worked for Tables where IAM and LOB chains were complete, but had dammaged data pages. Could keep control in script past data pages that were damaged.
    For all tables that had IAM or LOB chain damaged (a damaged page in chain), could not get any record %%PHYSLOC%%.
    I got record’s %%PHYSLOC%% with DBCC IND, and DBCC PAGE, but then could not get the record. It seems that SQL Server gives up on a table with damaged allocation chains, or I cannot keep control in script.
    The effort to fix pages inside SQL Server so it can provide data seems to be a never ending path of unforeseen issues, so I switched to extract data directly from files.
    Thank you again for all your posts that reveal bits of information about SQL Server internals.

  7. I was looking at three rows in one table, and could not figure out why the record lengths were what they were according to this anatomy of a record dissection for 2 of the three rows. However, after looking into it, I found that fixed width columns were added after the creation of the rows, and only the third row had been updated since that time. My thought is that the Alter Table Add Column statement has been optimized so that it does not go through each row and add in the new columns, and they are only added after the row is otherwise updated.

    Can you confirm this? I can’t seem to find an explanation of Alter Table Add Column’s affect on a row level.

  8. Sir Paul Randal,
    Could you tell me where table’s schema information (id and its transformed text e.g dbo, user-defined schema) is stored ? I have a really hard time trying to learn about this. Thank you Sir!

  9. Is there any affect, subtle or otherwise, in the order of fields of different type in a table. E.g.: if you have a large table and don’t have your strings (nvarchar) or bit or integers grouped together could there be a performance hit?

  10. Hi Paul,

    In the post, you have stated that “Remember that null bitmap entries point to the end of the column value – this is done so that we know how long each column is without having to store their length as well.”

    Does null bitmap point to the end of the column value or variable length column offset array point to the end of the column value?

Leave a Reply

Your email address will not be published. Required fields are marked *

Other articles

Imagine feeling confident enough to handle whatever your database throws at you.

With training and consulting from SQLskills, you’ll be able to solve big problems, elevate your team’s capacity, and take control of your data career.