This week I'm going to post a bunch of info on the basic structures used to store data and track allocations in SQL Server. A bunch of this was posted back when I started blogging at TechEd 2006 but I want to consolidate/clarify info and add more about using DBCC PAGE to examine the various structures.
So, what are records? At the simplest level, a record is the physical storage associated with a table or index row. Of course, it gets much more complicated than that...
Data records
- Data records are stored on data pages.
- Data records store rows from a heap or the leaf level of a clustered index.
- A data record always stores all columns from a table row - either by-value or by-reference.
- If any columns are for LOB data types (text, ntext, image, and the new LOB types in SQL Server 2005 - varchar(max), nvarchar(max), varbinary(max), XML), then there's a pointer stored in the data record which points to a text record on a different page (the root of a loose tree that stores the LOB value). Exceptions to this are when the schema has been set to store LOB columns 'in-row' when possible. This is when a LOB value is small enough to fit within the size limits of a data record. This is a performance benefit as selecting the LOB column does not require an extra IO to read the text record.
- In SQL Server 2005, non-LOB variable length columns (e.g. varchar, sqlvariant) may also be stored 'off-row' as part of the row-overflow feature of having table rows longer than 8060 bytes. In this case the storage format is the same as for LOB values - a pointer in the data record pointing to a text record.
- There is a difference to how the columns are laid out between heaps and clustered indexes - I'll cover that in a later post.
Forwarded/Forwarding records
- These are technically data records and are only present in a heap.
- A forwarded record is a data record in a heap that was updated and was too large to fit in-place on its original page and so has been moved to another page. It contains a back-pointer to the forwarding record.
- A forwarding record is left in its place and points to the new location of the record. It's sometimes known as a forwarding-stub, as all it contains is the location of the real data record.
- This is done to avoid having to update any non-clustered index records that point back directly to the original physical location of the record.
- Although this optimizes non-clustered index maintenance during updates, it can cause additional IOs during SELECTs. This is because the non-clustered index record points to the old location of the index, so an extra IO might be needed to read the real location of the data row. This is fuel for the heap vs clustered index debate, in favor of clustered indexes.
Index records
- Index records are stored on index pages.
- There are two types of index records (which differ only in what columns they store):
- Those that store non-clustered index rows at the leaf level of a non-clustered index
- Those that comprise the b-tree that make up clustered and non-clustered indexes (i.e. in index pages above the leaf level of a clustered or non-clustered index)
- I'll explain more about the differences between these in a later post as it can be quite complicated (especially the differences between SQL Server 2000 and 2005) and is worth doing in separate posts.
- Index records typically do not contain all the column values in a table (although some do - called covering indexes).
- In SQL Server 2005, non-clustered index records can include LOB values as included columns (with the storage details exactly the same as for data records) and also can have row-overflow data that is pushed off-row (again, in exactly the same way as for data records).
Text records
- Text records are stored on text pages.
- There are various types of text records that comprise the tree structure that stores LOB values, stored on two types of text page. I'll explain how they work and are linked together in a future post.
- They are also used to store variable-length column values that have been pushed out of data or index records as part of the row-overflow capability.
Ghost records
- These are records that have been logically deleted but not physically deleted from a page. The reasons for this are complicated, but basically having ghost records simplfies key-range locking and transaction rollback.
- The record is marked with a bit that indicates it's a ghost record and cannot be physically deleted until the transaction that caused it to be ghosted commits. Once this is done, it is deleted by an asynchronous background proces (called the ghost-cleanup task) or it is converted back to a real record by an insert of a record with the exact same set of keys.
Other record types
- There are also records that are used to store various allocation bitmaps, intermediate results of sort operations, and file and database metadata (e.g. in the per-file fileheader page and database boot page). Again, I'll go into these in later posts (there's a big queue of posts building up :-))
Record structure
All records have the same structure, regardless of their type and use, but the number and type of columns will be different. For instance, a data record from a table with a complex schema may have hundreds of columns of various types whereas an allocation bitmap record will have a single column, filling up the whole page.
The record structure is as follows:
- record header
- 4 bytes long
- two bytes of record metadata (record type)
- two bytes pointing forward in the record to the NULL bitmap
- fixed length portion of the record, containing the columns storing data types that have fixed lengths (e.g. bigint, char(10), datetime)
- NULL bitmap
- two bytes for count of columns in the record
- variable number of bytes to store one bit per column in the record, regardless of whether the column is nullable or not (this is different and simpler than SQL Server 2000 which had one bit per nullable column only)
- this allows an optimization when reading columns that are NULL
- variable-length column offset array
- two bytes for the count of variable-length columns
- two bytes per variable length column, giving the offset to the end of the column value
- versioning tag
- this is in SQL Server 2005 only and is a 14-byte structure that contains a timestamp plus a pointer into the version store in tempdb
NULL bitmap optimization
So why is the NULL bitmap an optimization?
Firstly, having a null bitmap removes the need for storing special 'NULL' values for fixed-length datatypes. Without the null bitmap, how can you tell whether a column is NULL? For fixed-length columns you'd need to define a special 'NULL' value, which limits the effective range of the datatype being stored. For varchar columns, the value could be a zero-length empty string, so just checking the length doesn't work - you'd need the special value again. For all other variable-length data types you can just check the length. So, we nede the NULL bitmap.
Secondly, it saves CPU cycles. If there was no NULL bitmap, then there are extra instructions executed for fixed-length and variable-length columns.
For fixed-length:
- read in the stored column value (possibly taking a cpu data cache miss)
- load the pre-defined NULL value for that datatype (possibly taking a cpu data cache miss, but only for the first read in the case of a multiple row select)
- do a comparison between the two values
For variable-length:
- calculate the offset of the variable length array
- read the number of variable length columns (possibly taking a cpu data cache miss)
- calculate the position in the variable length offset array to read
- read the column offset from it (possibly taking a cpu data cache miss)
- read the next one too (possibly taking another cpu data cache miss, if the offset in step 4 was on the boundary of a cache line size)
- compare them to see if they're the same
But with a NULL bitmap, all you have to do is:
- read the NULL bitmap offset (possibly taking a cpu data cache miss)
- calculate the additional offset of the NULL bit you want to read
- read it (possibly taking a cpu data cache miss)
So, its about even for a lookup of a single fixed-length column, but for variable-length columns, and for multiple row selects, there's a clear advantage to having the NULL bitmap.
Using DBCC IND and DBCC PAGE to examine a row in detail
Let's create an example table to look at:
USE MASTER;
GO
IF DATABASEPROPERTY (N'recordanatomy', 'Version') > 0 DROP DATABASE recordanatomy;
GO
CREATE DATABASE recordanatomy;
GO
USE recordanatomy;
GO
CREATE TABLE example (destination VARCHAR(100), activity VARCHAR(100), duration INT);
GO
INSERT INTO example VALUES ('Banff', 'sightseeing', 5);
INSERT INTO example VALUES ('Chicago', 'sailing', 4);
GO
And we can use DBCC IND again to find the page to look at: