Thursday, October 04, 2007

Are you coming to SQL Connections in November? If you are, and you're coming to either of our interactive sessions, we invite you to send us questions in advance that we will pick from and answer during the sessions. The two sessions are:

  • SDB351: Follow the Rabbit - Interactive Q&A on Availability

In this session, Kimberly Tripp and Paul Randal will have only 5-10 slides. The focus of this session is on mixing availability technologies to create the best overall architecture to minimize downtime and data loss. In general, we're going to focus on best practices and then open up to your questions so that you can drive the discussion! This session might not seem as structured as other sessions, but you'll be surprised at how informative and fun it is! Grab your lunch and come back - we'll probably still be hanging out!

  • SDB349: Follow the Rabbit - Interactive Q&A on the Storage Engine and the Relational Engine

In this session, Kimberly Tripp and Paul Randal will have only 5-10 slides. Each slide cover topics for discussion as well as the reason(s) for why something might be behaving badly and/or things to try to solve your problems. In general we're going to focus on best practices and then open up to your questions so that you can drive the discussion! Paul will focus on the SE (Storage Engine) and internals and Kimberly will focus on the RE (Relational Engine) and query tuning/performance. This session might not seem as structured as other sessions, but you'll be surprised at how informative and fun it is!

Please send your questions to questions@sqlskills.com with a subject line of 'Follow the Rabbit'. If we read out your question and answer it during the session, we'll give you a prize - one of our DVDs full of hands-on labs to learn from (the ones we give out to pre-con and post-con attendees) - we'll definitely do a few per session. Thanks!

Paul and Kimberly

Thursday, October 04, 2007 1:19:57 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 

I mentioned this in my Anatomy of a page post - its a common misconception that records in an index are ALWAYS stored in the same physical order as the logical order defined by the index key. Here's proof for you that this is incorrect (as well as introducing you to the other dump styles for DBCC PAGE).

I'm going to create a table with a clustered index on an integer column and keep the table to a single page for simplicity:

USE MASTER;

GO

IF DATABASEPROPERTY (N'rowordertest', 'Version') > 0 DROP DATABASE rowordertest;

GO

CREATE DATABASE rowordertest;

GO

USE rowordertest;

GO

CREATE TABLE t1 (c1 INT, c2 VARCHAR (10));

CREATE CLUSTERED INDEX t1c1 ON t1 (c1);

GO

Now I'm going to insert a few rows into the table, with c1 from 2 to 5 - conspicuously not inserting c1 = 1:

INSERT INTO t1 VALUES (2, REPLICATE ('b', 10));

INSERT INTO t1 VALUES (3, REPLICATE ('c', 10));

INSERT INTO t1 VALUES (4, REPLICATE ('d', 10));

INSERT INTO t1 VALUES (5, REPLICATE ('e', 10));

GO

Now, using DBCC IND  we see that the data page is (1:143) and dumping that with DBCC PAGE gives the following (skipping the header output):

DBCC IND ('rowordertest', 't1', 1);

GO

DBCC TRACEON (3604);

GO

DBCC PAGE ('rowordertest', 1, 143, 3);

GO

Slot 0 Offset 0x60 Length 27

Record Type = PRIMARY_RECORD         Record Attributes =  NULL_BITMAP VARIABLE_COLUMNS

Memory Dump @0x5BA3C060

00000000:   30000800 02000000 0300f802 0011001b †0...............
00000010:   00626262 62626262 626262†††††††††††††.bbbbbbbbbb
UNIQUIFIER = [NULL]  

Slot 0 Column 1 Offset 0x4 Length 4

c1 = 2

Slot 0 Column 2 Offset 0x11 Length 10

c2 = bbbbbbbbbb

<snip slots 1 and 2>                   

Slot 3 Offset 0xb1 Length 27

Record Type = PRIMARY_RECORD         Record Attributes =  NULL_BITMAP VARIABLE_COLUMNS

Memory Dump @0x5BA3C0B1

00000000:   30000800 05000000 0300f802 0011001b †0...............
00000010:   00656565 65656565 656565†††††††††††††.eeeeeeeeee
UNIQUIFIER = [NULL]

Slot 3 Column 1 Offset 0x4 Length 4

c1 = 5

Slot 3 Column 2 Offset 0x11 Length 10

c2 = eeeeeeeeee

DBCC PAGE with dump-style 3 always outputs the records on a page in their logical order (because that's how the slot array is ordered). Notice that the record with c1 = 2 is stored at offset 0x60 in the page and the last record on the page with c1 = 5 is stored at offset 0xb1. Now we'll insert a record with c1 =1. This will become the first logical record in the index, but will it cause the page to be shuffled so the records can all be stored in logical order?

INSERT INTO t1 VALUES (1, REPLICATE ('a', 10));

GO

DBCC PAGE ('rowordertest', 1, 143, 3);

GO

Slot 0 Offset 0xcc Length 27

Record Type = PRIMARY_RECORD         Record Attributes =  NULL_BITMAP VARIABLE_COLUMNS

Memory Dump @0x61FCC0CC

00000000:   30000800 01000000 0300f802 0011001b †0...............
00000010:   00616161 61616161 616161†††††††††††††.aaaaaaaaaa
UNIQUIFIER = [NULL]

Slot 0 Column 1 Offset 0x4 Length 4

c1 = 1

Slot 0 Column 2 Offset 0x11 Length 10

c2 = aaaaaaaaaa

Slot 1 Offset 0x60 Length 27

Record Type = PRIMARY_RECORD         Record Attributes =  NULL_BITMAP VARIABLE_COLUMNS

<snip>

The answer is no. Even though the record with c1 =1 is output by DBCC PAGE first, look at its offset within the page - 0xCC - clearly the last record on the page and stored in a different physical order than the logical order defined by the index key. Further proof can be obtained by looking at a raw hex dump of the page using dump style 2 with DBCC PAGE:

DBCC PAGE ('rowordertest', 1, 143, 2);

GO

<snip>

6204C000:   01010400 00400001 00000000 00000800 †.....@..........
6204C010:   00000000 00000500 44000000 0f1fe700 †........D.......
6204C020:   8f000000 01000000 13000000 60000000 †............`...
6204C030:   16000000 00000000 00000000 00000000 †................
6204C040:   01000000 00000000 00000000 00000000 †................
6204C050:   00000000 00000000 00000000 00000000 †................
6204C060:   30000800 02000000 0300f802 0011001b †0...............
6204C070:   00626262 62626262 62626230 00080003 †.bbbbbbbbbb0....
6204C080:   00000003 00f80200 11001b00 63636363 †............cccc
6204C090:   63636363 63633000 08000400 00000300 †cccccc0.........
6204C0A0:   f8020011 001b0064 64646464 64646464 †.......ddddddddd
6204C0B0:   64300008 00050000 000300f8 02001100 †d0..............
6204C0C0:   1b006565 65656565 65656565 30000800 †..eeeeeeeeee0...
6204C0D0:   01000000 0300f802 0011001b 00616161 †.............aaa
6204C0E0:   61616161 61616100 00000000 00000000 †aaaaaaa.........
6204C0F0:   00000000 00000000 00000000 00000000 †................

<snip>

You can clearly see that the last row I inserted, with c1 = 1 and the replicated 'a's is stored after the other records on the page, even though its key is logically before the others.

And just to nail the point home, doing a dump style 1 with DBCC PAGE will dump out the slot array for us:

DBCC PAGE ('rowordertest', 1, 143, 1);

GO

<snip>

OFFSET TABLE:

Row - Offset
4 (0x4) - 177 (0xb1)
3 (0x3) - 150 (0x96)
2 (0x2) - 123 (0x7b)
1 (0x1) - 96 (0x60)
0 (0x0) - 204 (0xcc)

<snip>

The slot array grows backwards, which is why its dumped in what looks like reverse logical order. You can see that slot 0, which represents the first logical record on the page, is stored at an offset greater than the others.

Thursday, October 04, 2007 7:29:58 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 
Wednesday, October 03, 2007

This is a combo from some previously posted material, with some more DBCC PAGE output thrown in.

IAM pages

An IAM (Index Allocation Map) page tracks approximately 4GB worth of space in a single file, aligned on a 4GB boundary. These 4GB chunks are called 'GAM intervals'. An IAM page tracks which extents within that specific GAM interval belongs to a single entity (I'm chosing my words carefully here and not using any word that has SQL Server connotations like 'object').

An IAM page can only track the space for a single GAM interval in a single file so If the database has multiple files, or some files are more then 4GB, and the entity has space allocated from multiple files or multiple GAM intervals within a file, then you can see how multiple IAM pages are needed for each entity to track all the space that its using. If an entity requires multiple IAM pages to track all its extents, then they IAM page have to be linked together. That's where an IAM chain comes in. More on these below.

Each IAM page has two records, an IAM page header and a bitmap. Let's look at one with DBCC PAGE. I'm using the database from the page split post. Doing a DBCC IND on the table we created gives us:

SEQA3.jpg

By looking at the PageType column, we can see that there's an IAM page (a page with type 10 - see the post on Anatomy of a page for more details) with page ID (1:152):

DBCC TRACEON (3604);

GO

DBCC PAGE ('pagesplittest', 1, 152, 3);

GO

m_pageId = (1:152)                   m_headerVersion = 1                  m_type = 10
m_typeFlagBits = 0x0                 m_level = 0                          m_flagBits = 0x200
m_objId (AllocUnitId.idObj) = 68     m_indexId (AllocUnitId.idInd) = 256
Metadata: AllocUnitId = 72057594042384384
Metadata: PartitionId = 72057594038386688                                 Metadata: IndexId = 1
Metadata: ObjectId = 2073058421      m_prevPage = (0:0)                   m_nextPage = (0:0)
pminlen = 90                         m_slotCnt = 2                        m_freeCnt = 6
m_freeData = 8182                    m_reservedCnt = 0                    m_lsn = (18:116:13)
m_xactReserved = 0                   m_xdesId = (0:0)                     m_ghostRecCnt = 0
m_tornBits = -1947725876

Allocation Status

GAM (1:2) = ALLOCATED                SGAM (1:3) = ALLOCATED
PFS (1:1) = 0x70 IAM_PG MIXED_EXT ALLOCATED   0_PCT_FULL                  DIFF (1:6) = CHANGED
ML (1:7) = NOT MIN_LOGGED

IAM: Header @0x620CC064 Slot 0, Offset 96

sequenceNumber = 0                   status = 0x0                         objectId = 0
indexId = 0                          page_count = 0                       start_pg = (1:0)


IAM: Single Page Allocations @0x620CC08E

Slot 0 = (1:143)                     Slot 1 = (1:153)                     Slot 2 = (1:154)
Slot 3 = (0:0)                       Slot 4 = (0:0)                       Slot 5 = (0:0)
Slot 6 = (0:0)                       Slot 7 = (0:0)


IAM: Extent Alloc Status Slot 1 @0x620CC0C2

(1:0)        - (1:272)      = NOT ALLOCATED

Some things to note about the page header itself:

  • The page has type 10, as we'd expect
  • The previous and next page pointers are NULL, because there aren't any other IAM pages in this IAM chain
  • The slot count is 2 - one for the IAM header record and one for the bitmap itself
  • The page is almost entirely full

The IAM page header has the following fields:

  • sequenceNumber
    • This is the position of the IAM page in the IAM chain. This increases by one for each page added to the IAM chain.
  • status
    • This is unused.
  • objectId
  • indexId
    • On SQL Server 2000 and before, these are the object  and index IDS that the IAM page is part of. On SQL Server 2005 and later they are unused.
  • page_count
    • This is unused  - it used to be the number of page IDs that are being tracked in the single page allocation array.
  • start_pg
    • This is the GAM interval that the page maps. It stores the first page ID in the mapped interval.
  • Single Page Allocations array
    • These are the pages that have been allocated from mixed extents. This array is only used in the first IAM page in the chain (as the whole IAM chain only need to track at most 8 single-page allocations).

The bitmap occupies the rest of the IAM page and has a bit for each extent in the GAM interval. The bit is set if the extent is allocated to the entity, and clear if it is not. Obviously two IAM pages that map the same GAM interval for different entities cannot both have the same bit set - this is checked by DBCC CHECKDB. In the output from DBCC PAGE above, you can see that there are no extents allocated to the table. You'll notice that the output only goes up to the extent starting at page 272 in the file - this is because the data file is only that big. I inserted a bunch more rows into the table and did another DBCC PAGE of the IAM page. This time the DBCC PAGE output contains:

IAM: Single Page Allocations @0x620CC08E

Slot 0 = (1:143)                     Slot 1 = (1:153)                     Slot 2 = (1:154)
Slot 3 = (1:155)                     Slot 4 = (1:156)                     Slot 5 = (1:157)
Slot 6 = (1:158)                     Slot 7 = (1:159)


IAM: Extent Alloc Status Slot 1 @0x620CC0C2

(1:0)        - (1:152)      = NOT ALLOCATED
(1:160)      - (1:296)      =     ALLOCATED
(1:304)      - (1:400)      = NOT ALLOCATED

You can see that the entire single-page allocation array is full and then allocations switched to dedicated extents. The first available extent must have been the one starting at page 160 and all extents up to an including the one starting at page 296 are now allocated. Note also that the file must have grown because the output now goes up to page 400 in the file.

A couple more things to note about IAM pages:

  • There are themselves single-page allocations from mixed extents and are not tracked anywhere
  • They can be allocated from any file to track extents in any other file

IAM chains

If we continued to grow the file and fill up the table then eventually we'd need another IAM page to map the next GAM interval. This is where an IAM chain comes in. It's a linked-list of IAM pages that track the space allocated to a single entity. The linked-list is not sorted at all - IAM pages are appended to it in the order that they're needed. The IAM pages within the list are numbered, again, in the order that they were appended to the list.

Definition of 'entity' - what uses an IAM chain? This is vastly different in SQL Server 2000 and 2005.

In SQL Server 2000, a single IAM chain is used for each:

  • Heap or clustered index
    • A table can only have one or the other, not both. These have index IDs of 0 and 1 respectively.
  • Non-clustered index
    • These have index IDs from 2 to 250 (i.e. you can only have 249 of them)
  • Table's complete LOB storage
    • For LOB columns (text, ntext, image) in the heap or clustered index. This is sometimes called the 'text index' and has a fixed index ID of 255.

This gives a maximum of 251 IAM chains per object in SQL Server 2000 and before. I usually generalize and say that in SQL Server 2000, there's one IAM chain per index (which fits nicely if you remember that IAM stands for Index Allocation Map).

Allocation units (SQL Server 2005 and later)

Now in SQL Server 2005 and later, things have changed a lot. IAM chains and IAM pages are exactly the same, but what they correspond to is different. A table can now have up to 750000 IAM chains! There are now three things that IAM chains map space allocations for:

  1. heaps and b-trees (a b-tree is the internal structure used to store an index)
  2.  LOB data
  3. row-overflow data

and we now call these units of space allocation tracking allocation units. The internal names for these three types of allocation unit are (respectively):

  1. hobt allocation unit (Heap Or B-Tree, pronounced 'hobbit', yes, as in Lord Of The Rings)
  2. LOB allocation unit
  3. SLOB allocation unit (Small-LOB or Short-LOB)

and the external names are, respectively:

  1. IN_ROW_DATA allocation unit
  2. LOB_DATA allocation unit
  3. ROW_OVERFLOW_DATA allocation unit

They couldn't really continue to be called IAM chains, because they're no longer tracking space allocation for an index. However, they're chain of IAM pages is still called an IAM chain, and the unit of tracking is now called an allocation unit. Apart from that, there's no difference.

Let's have a quick look at three new features in SQL Server 2005 that made these changes necessary and boosted the number of potential IAM chains per table.

Included Columns
This is the ability for non-clustered indexes to include non-key columns at the leaf-level. This is useful for three reasons:

  1. Iit allows a non-clustered index to truly cover a query where the query results include more than 16 columns or the combination of column lengths in the query results is greater than 900 bytes (remember that a non-clustered index key is limited to 16 columns and 900 bytes).
  2. It allows columns to be include in the non-clustered index that have data types that cannot be part of an index key (e.g. varchar(max) or XML).
  3. It allows a non-clustered index to cover a query without having to have all the columns included in the index key. As the index key is included in rows at all levels of the b-tree, this allows the index to be smaller.

An example of space saving: imagine a 100 million row index, with a key length of 900 bytes, but only the first two integer keys are really needed as the index key, the other 4 fixed-length columns could be stored in the index as included columns. With the 900 byte key, 8 rows can fit per database page (i.e. the fanout is 8). This means there will be 12500000 pages at the leaf level, 1562500 pages at the next level up in the b-tree and so on, giving a total of 12500000 + 1562500 + 195313 + 24415 + 3052 + 382 + 48 + 6 + 1 = 14285717 pages (including 1785717 to store the upper levels of the b-tree).

If we go with the included columns method then the key size shrinks to 8 bytes, and with the row overhead we can get the row length in the upper levels of the b-tree down to 15 bytes (giving a fanout of approx. 537). Note that the fanout at the leaf-level is still going to be 8,  because the amount of data stored in each row at the leaf-level is the same. So, this means there will be 12500000 pages at the leaf level, 23278 pages at the next level up and so on, giving a total of 12500000 + 23278 + 44 + 1 = 12523323 pages (including 23323 to store the upper levels of the b-tree). Compared to the full-size 900-byte key, this is a 12% saving of 1762394 pages, or 13.6GB! Obviously this is a contrived case but you can see how the savings can occur.

The main reason for adding this feature it to enable true covering queries. A covering query is one where the query optimizer knows it can get all the query results from the non-clustered index and so the query can be satisfied without having to incur the extra IOs of looking up data in the base table - a significant performance saving.

Now that non-clustered indexes can have included columns, and those columns can be LOB data types (but only the new ones in SQL Server 2005 - varchar(max), nvarchar(max), varbinary(max), and XML). This means that having a single LOB allocation unit (as in the case of the single text index in SQL Server 2000) isn't possible any more because each index may have its own set of LOBs. Now, you may ask why there isn't just a single set of LOBs with multiple references from various indexes plus the base table. We considered that during SQL Server 2005 development but it would have made things a lot more complicated.

So, with this new feature, each index needs two allocation units - one for the data or index rows (the hobt allocation unit) and one for any LOB data.

Large Rows

One of the things that has plagued schema designers for a long time is the 8060 byte limit on table row sizes so this restriction was removed in SQL Server 2005. The way this is done is to allow variable-length columns (e.g. varchar, sqlvariant) to get pushed off-row when the row size gets too big to fit on a single page.

But where do these column values get pushed to? They're effectively turned into mini LOB columns. The column value in the row is replaced with a 16-byte pointer to the off-row column value, which is stored as if its a LOB value in a seperate allocation unit - the row-overflow (or SLOB) allocation unit. These values are stored in text pages in exactly the same way as regular LOB values are, just using a separate allocation unit. The SLOB allocation unit is only created when the first column value is pushed off-row.

This feature works for non-clustered indexes too - if you consider the ability to have included columns in non-clustered indexes then you could easily have non-clustered index rows that won't fit on a page. It would have been short-sighted of to get rid of the 900-byte limit and replace it with an 8060-byte limit by not extending the row-overflow feature to non-clustered indexes too.

Now with the addition of this new feature, each index can have up to three allocation units - hobt, LOB, and SLOB. Even with this, that only makes a maximum of 750 IAM chains per table (remember an IAM chain now maps the storage allocations for an allocation unit, so 250 indexes * 3 allocation units = 750 IAM chains). But I mentioned 750 thousand IAM chains per table earlier - where do all the rest come from?

Partitioning

This is what gives us the 1000x multiplier. As you may already know, partitioning is the new feature that allows tables and indexes to be split into a series of ranges, with each range stored separately (most commonly in seperate filegroups). Partitioning is a topic for a separate post.

If each range or partition of the table or index is stored seperately, then each is going to need its own hobt allocation unit. Of course, the LOB values associated with each partition need to be stored with it, and so each partition also needs a LOB allocation unit. Also, the row-overflow feature is per-row, and so rows in each partition will overflow into SLOB allocation units just as for un-partitioned tables and indexes. Thus each partition of a table or index can have up to three allocation units (and hence three IAM chains).

Still, where does that 1000 come in? Each table or index can have up to 1000 partitions. This gives us 250 indexes x 1000 partitions x 3 allocation units = 750000 IAM chains. Realistically this probably won't happen, but it is possible.

Wednesday, October 03, 2007 4:42:23 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 

In SQL Server 2005, queries over partitioned tables use a single-thread per partition. This can cause performance problems under certain circumstances:

  1. On systems with many CPUs, if the table schema has less partitions than there are CPUs, then not all the CPUs will be used to process the query. Some examples:
    1. On a 32-way box, a query over a 12-partition table (e.g. a sales table partitioned by month) will only use 12 threads (one on each of 12 CPUs). This means 20 CPUs are potentially idle.
    2. On a 4-way box, a query over a 12-partition table only accesses a single partition so will only use one thread (on one CPU). This means 3 CPUs are potentially idle.
  2. On tables that have skewed data such that one partition is much larger than another, the length of time the query takes to complete will be bounded by the single thread processing the largest partition.

As part of the set of improvements in SQL Server 2008 for data warehousing there will be an option to change the threading behavior for queries over partitioned tables. The new, alternative model is that all available threads process part of each partition and then move into the next partition. This allows all available CPUs to take part in processing the query, which should lead to a drop in the query completion time.

The only time this model won't work is if the data is not in the buffer pool and is not spread out evenly across the available drives. For example, if an entire partition is stored on a single drive, then multiple threads will be scanning different portions of the drive, causing the disk head to thrash and IO throughput to drop sharply compared with a single thread driving the IO. For this reason, the option to use the new model will be off by default, to avoid surprising people with sudden bad performance after upgrading.

This should be available in the next CTP and then I'll post again with some example datasets and queries to see what the potential benefits and drawbacks are.

Wednesday, October 03, 2007 2:40:26 PM (Pacific Standard Time, UTC-08:00)  #    Comments [1]  | 

This one's a quickie.

In the previous post I explained about database pages - their structure and some page types. Now I'd like to explain how pages are grouped into units called extents. An extent is a group of eight physically consecutive pages in a data file. Extents are always aligned on 64KB boundaries (i.e. 8-page boundaries), starting at the beginning of the file. Extents, and all their properties, are exactly the same in SQL Server 2000 and 2005.

There are two types of extents: mixed extents and dedicated (or uniform) extents.

Mixed extents

The first 8 pages that are allocated to any IAM chain (either for an index in SQL Server 2000 or allocation unit in SQL Server 2005) are single-page allocations, which are called mixed pages. This means that only a single page is allocated to the IAM chain at a time, rather than a whole extent. The rationale behind this is to allow very small tables to take up the minimum amount of space. Once an IAM chain has passed the 8-page threshold and switched to allocating dedicated extents, it will never go back to allocating mixed pages again.

These mixed pages are allocated from mixed extents that are not allocated to any particular IAM chain. The extent is tracked as being allocated globally (by a GAM page), so no IAM chain can allocate it. If the mixed extent has any unallocated pages in it, it will also be tracked by an SGAM page. When a mixed page is required to be allocated, the SGAM pages are checked to see if any such extents are available. If not, a new mixed extent is allocated, a page is allocated from it and then the extent is tracked by the relevant SGAM page until all its pages are allocated.

As the mixed extent is not allocated to any particular IAM chain, this means that it may hold pages allocated to possibly 8 separate IAM chains. IAM pages themselves are always mixed pages, regardless of how many pages have been allocated to an IAM chain. This means a mixed extent may hold a variety of page types too,including IAM, data, index, or text pages.

Dedicated/Uniform extents

Once the 8-page threshold is passed, all further allocations to an IAM chain are from dedicated extents. This means that an extent at a time is allocated to an IAM chain and marked as such in one of the IAM pages in the IAM chain - whichever one maps the GAM interval that the extent is in. The extent is also tracked as being allocated by the relevant GAM page so no other IAM chain can allocate it.

All pages from a dedicated extent must be allocated to the same IAM chain. They do not all need to be the same type, however. For instance, a clustered index has a mixture of data and index pages. Also, when an extent is allocated to an IAM chain the pages in the extent are not all allocated at once (except for certain bulk operations). The pages are usually allocated as needed. The allocation state of each page is tracked using a PFS page.

When all the pages from a dedicated extent are deallocated, the extent itself is deallocated from its owning IAM chain and is available again for allocation to any of IAM chain, or to become a mixed extent.

Tracking changes to extents for backup

Changes to extents are tracked in two places:

  • Any extent that has been changed since the last full or differential backup will be tracked in the relevant differential bitmap page. This is how a differential backup knows which extents to backup instead of the whole database. All diff map pages are reset when the next backup is taken.
  • Any extent that has been changed by a bulk-logged operation since the last full, differential, or log backup will be tracked in the relevant minimally-logged bitmap page. Any log backup that's taken after a bulk-logged operation will also include all extents tracked this way. All ML map pages are reset when the next backup is taken.

More on PFS, GAM, SGAM, ML, and DIFF pages in a not-too-distant future post.

Wednesday, October 03, 2007 2:13:38 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 

Next up in the Inside the Storage Engine series is a discussion of page structure. Pages exist to store records. A database page is an 8192-byte (8KB) chunk of a database data file. They are aligned on 8KB boundaries within the data files, starting at byte-offset 0 in the file. Here's a picture of the basic structure:

page.gif

Header

The page header is 96 bytes long. What I'd like to do in this section is take an example page header dump from DBCC PAGE and explain what all the fields mean. I'm using the database from the page split post and I've snipped off the rest of the DBCC PAGE output.

DBCC TRACEON (3604)

DBCC PAGE ('pagesplittest', 1, 143, 1);

GO

m_pageId = (1:143)                   m_headerVersion = 1                  m_type = 1
m_typeFlagBits = 0x4                 m_level = 0                          m_flagBits = 0x200
m_objId (AllocUnitId.idObj) = 68     m_indexId (AllocUnitId.idInd) = 256 
Metadata: AllocUnitId = 72057594042384384                                
Metadata: PartitionId = 72057594038386688                                 Metadata: IndexId = 1
Metadata: ObjectId = 2073058421      m_prevPage = (0:0)                   m_nextPage = (1:154)
pminlen = 8                          m_slotCnt = 4                        m_freeCnt = 4420
m_freeData = 4681                    m_reservedCnt = 0                    m_lsn = (18:116:25)
m_xactReserved = 0                   m_xdesId = (0:0)                     m_ghostRecCnt = 0
m_tornBits = 1333613242
             

Here's what all the fields mean (note that the fields aren't quite stored in this order on the page):

  • m_pageId
    • This identifies the file number the page is part of and the position within the file. In this example, (1:143) means page 143 in file 1.
  • m_headerVersion
    • This is the page header version. Since version 7.0 this value has always been 1.
  • m_type
    • This is the page type. The values you're likely to see are:
      • 1 - data page. This holds data records in a heap or clustered index leaf-level.
      • 2 - index page. This holds index records in the upper levels of a clustered index and all levels of non-clustered indexes.
      • 3 - text mix page. A text page that holds small chunks of LOB values plus internal parts of text tree. These can be shared between LOB values in the same partition of an index or heap.
      • 4 - text tree page. A text page that holds large chunks of LOB values from a single column value.
      • 7 - sort page. A page that stores intermediate results during a sort operation.
      • 8 - GAM page. Holds global allocation information about extents in a GAM interval (every data file is split into 4GB chunks - the number of extents that can be represented in a bitmap on a single database page). Basically whether an extent is allocated or not. GAM = Global Allocation Map. The first one is page 2 in each file. More on these in a later post.
      • 9 - SGAM page. Holds global allocation information about extents in a GAM interval. Basically whether an extent is available for allocating mixed-pages. SGAM = Shared GAM. the first one is page 3 in each file. More on these in a later post.
      • 10 - IAM page. Holds allocation information about which extents within a GAM interval are allocated to an index or allocation unit, in SQL Server 2000 and 2005 respectively. IAM = Index Allocation Map. More on these in a later post.
      • 11 - PFS page. Holds allocation and free space information about pages within a PFS interval (every data file is also split into approx 64MB chunks - the number of pages that can be represented in a byte-map on a single database page. PFS = Page Free Space. The first one is page 1 in each file. More on these in a later post.
      • 13 - boot page. Holds information about the database. There's only one of these in the database. It's page 9 in file 1.
      • 15 - file header page. Holds information about the file. There's one per file and it's page 0 in the file.
      • 16 - diff map page. Holds information about which extents in a GAM interval have changed since the last full or differential backup. The first one is page 6 in each file.
      • 17 - ML map page. Holds information about which extents in a GAM interval have changed while in bulk-logged mode since the last backup. This is what allows you to switch to bulk-logged mode for bulk-loads and index rebuilds without worrying about breaking a backup chain. The first one is page 7 in each file.
  • m_typeFlagBits
    • This is mostly unused. For data and index pages it will always be 4. For all other pages it will always be 0 - except PFS pages. If a PFS page has m_typeFlagBits of 1, that means that at least one of the pages in the PFS interval mapped by the PFS page has at least one ghost record.
  • m_level
    • This is the level that the page is part of in the b-tree.
    • Levels are numbered from 0 at the leaf-level and increase to the single-page root level (i.e. the top of the b-tree).
    • In SQL Server 2000, the leaf level of a clustered index (with data pages) was level 0, and the next level up (with index pages) was also level 0. The level then increased to the root. So to determine whether a page was truly at the leaf level in SQL Server 2000, you need to look at the m_type as well as the m_level.
    • For all page types apart from index pages, the level is always 0.
  • m_flagBits
    • This stores a number of different flags that describe the page. For example, 0x200 means that the page has a page checksum on it (as our example page does) and 0x100 means the page has torn-page protection on it.
    • Some bits are no longer used in SQL Server 2005.
  • m_objId
  • m_indexId
    • In SQL Server 2000, these identified the actual relational object and index IDs to which the page is allocated. In SQL Server 2005 this is no longer the case. The allocation metadata totally changed so these instead identify what's called the allocation unit that the page belongs to (I'll do another post that describes these later today).
  • m_prevPage
  • m_nextPage
    • These are pointers to the previous and next pages at this level of the b-tree and store 6-byte page IDs.
    • The pages in each level of an index are joined in a doubly-linked list according to the logical order (as defined by the index keys) of the index. The pointers do not necessarily point to the immediately adjacent physical pages in the file (because of fragmentation).
    • The pages on the left-hand side of a b-tree level will have the m_prevPage pointer be NULL, and those on the right-hand side will have the m_nextPage be NULL.
    • In a heap, or if an index only has a single page, these pointers will both be NULL for all pages.
  • pminlen
    • This is the size of the fixed-length portion of the records on the page.
  • m_slotCnt
    • This is the count of records on the page.
  • m_freeCnt
    • This is the number of bytes of free space in the page.
  • m_freeData
    • This is the offset from the start of the page to the first byte after the end of the last record on the page. It doesn't matter if there is free space nearer to the start of the page.
  • m_reservedCnt
    • This is the number of bytes of free space that has been reserved by active transactions that freed up space on the page. It prevents the free space from being used up and allows the transactions to roll-back correctly. There's a very complicated algorithm for changing this value.
  • m_lsn
    • This is the Log Sequence Number of the last log record that changed the page.
  • m_xactReserved
    • This is the amount that was last added to the m_reservedCnt field.
  • m_xdesId
    • This is the internal ID of the most recent transaction that added to the m_reservedCnt field.
  • m_ghostRecCnt
    • The is the count of ghost records on the page.
  • m_tornBits
    • This holds either the page checksum or the bits that were displaced by the torn-page protection bits - depending on what form of page protection is turnde on for the database.

Note that I didn't include the fields starting with Metadata:. That's because they're not part of a page header. During SQL Server 2005 development I did some major work rewriting the guts of DBCC PAGE and to save everyone using it from having to do all the system table lookups to determine what the actual object and index IDs are, I changed DBCC PAGE to do them internally and output the results.

Records

See this blog post for details.

Slot Array

It's a very common misconception that records within a page are always stored in logical order. This is not true. There is another misconception that all the free-space in a page is always maintained in one contiguous chunk. This also is not true. (Yes, the image above shows the free space in one chunk and that very often  is the case for pages that are being filled gradually.)

If a record is deleted from a page, everything remaining on the page is not suddenly compacted - inserters pay the cost of compaction when its necessary, not deleters.

Consider a completely full page - this means that record deletions cause free space holes within the page. If a new record needs to be inserted onto the page, and one of the holes is big enough to squeeze the record into, why go to the bother of comapcting it? Just stick the record in and carry on. What if the record should logically have come at the end of all other records on the page, but we've just inserted it in the middle - doesn't that screw things up somewhat?

No, because the slot array is ordered and gets reshuffled as records are inserted and deleted from pages. As long as the first slot array entry points to the logically first record on the page, everything's fine. Each slot entry is just a two-byte pointer into the page - so its far more efficient to manipulate the slot array than it is to manipulate a bunch of records on the page. Only when we know there's enough free space contained within the page to fit in a record, but its spread about the page do we compact the records on the page to make the free space into a contiguous chunk.

One interesting fact is that the slot array grows backwards from the end of the page, so the free space is squeezed from the top by new rows, and from the bottom by the slot array.

Wednesday, October 03, 2007 10:49:00 AM (Pacific Standard Time, UTC-08:00)  #    Comments [5]  | 

ITForum.gifITForum.gif

 

 

 

November's going to be a busy month for Kimberly and I! Hot on the heels of SQL Connections in Las Vegas is the yearly pilgrimage to Barcelona for TechEd IT Forum and this year we're packing in a ton of sessions in between us.

 

Here's what we're doing:

Sessions

  • (DAT205) The Next Release of Microsoft SQL Server: Manageability Overview

The next release of SQL Server will contain exciting new manageability features targeted at reducing total cost of ownership. Come learn more about what's in store in areas like policy-driven administration and performance data collection and analysis. The session focuses on the database engine.

  • (DAT301) SQL Server Indexing - Unravelling the Unknown

Knowing tips and tricks to indexing is extremely helpful and will help you to solve "known" query problems. But what's lurking in the unknown? Is SQL Server using your indexes? Or, do you have a bunch of indexes sitting around wasting space and negatively impacting performance? Finally, SQL Server 2005 has an answer! SQL Server 2005 DMVs (Dynamic Management Views) can provide you with valuable information about your current indexing strategies, what should be removed, and even what's missing. Do you know how to find this information, leverage it, and then programmatically respond to it? Come to this session to learn a few tips and tricks as well as how to figure out the unknown indexing problems!

  • (DAT305) Secrets to Fast Detection and Recovery from Database Corruptions

How can you tell whether your data is corrupt? If you have corruption, how do you work out what’s wrong with the database? How do you ensure you have a valid backup? If you don’t have a valid backup, how and what do you repair? If you do have a backup, how do you work out whether you should restore or repair? And at what granularity? How do you go about determining what went wrong in the first place? It’s all about limiting downtime and data-loss when a corruption occurs - from knowing the tools to understanding the choices to planning a successful strategy. Some of the features discussed:

  • Torn-page detection and page checksums
  • IO read-retry
  • Backup checksums
  • Consistency checks (DBCC CHECKDB and related commands)
  • Database repairs

Facing database corruption is almost inevitable in every DBAs career - make sure you're prepared when it happens to you.

Chalk-talks

  • (DAT01-IS) SQL Server Upgrade Best Practices, Tips, and Tricks

Even though SQL Server 2005 has been out for a while, many companies are only just getting ready to upgrade. Come to this session to learn best practices, tips, and tricks distilled from two years of customer experiences. We'll also touch on some issues you'll face going to SQL Server 2008 when it's released next year. Come to this session to learn and share - bring your questions and experiences!

  • (DAT02-IS) SQL Server 2005 Database Mirroring: Setup to Implementation to Monitoring

Database Mirroring is one of the most exciting technologies in SQL Server 2005. With more and more people including it in their Disaster Recovery Strategies, it's important to know when to implement Database Mirroring as well as the implications of the architecture you choose. In this session there will be no slides just demos that will explore how Database Mirroring works in its various configurations and how that may effect your performance. Join us and see database mirroring in action and get your questions answered!

  • (DAT07-IS) DBCC Internals

All DBAs should have heard of (and used) DBCC – especially for consistency checking. Get down deeper than 400-level with this session on how some of the most important DBCC commands work. Topics covered include CHECKDB, SHRINKFILE, INDEXDEFRAG, and more.

Instructor-led labs

  • (DAT03-ILL) SQL Server Always On Technologies Instructor-Led Lab: Part 1 - Database Mirroring

See Database Mirroring in action! From implementation to monitoring to failover, database mirroring provides an ideal solution for many disaster recovery scenarios and this session will prepare you to handle them with minimal downtime or data loss.

  • (DAT04-ILL) SQL Server Always On Technologies Instructor-Led Lab: Part 2 - Database Snapshots

Database Snapshots are useful in many situations: database maintenance, data recovery, and point-in-time data access. You can even create a snapshot on a mirror database to get better return-on-investment (ROI) on your high-avaialbility (HA) investments. In this session we will explain how database snapshots work as well as go through several exercises, including working with multiple database snapshots and creating database snapshots on a mirror database.

  • (DAT05-ILL) SQL Server Always On Technologies Instructor-Led Lab: Part 3 - Online Operations

The bane of any DBA's life is to have to take data offline to perform maintenance or recover from a disaster. The various Online Operations in SQL Server 2005 alleviate much of this frustration. This session will show you how to move a table online for better isolation and control, partition a table online, access a database that's partially damaged, and perform online piecemeal restore.

Lunchtime Demos

  • (DAT01-PD) Database Recovery Techniques

In this fast-paced demo session nasty things will be done repeatedly to database. Then the methods and approaches to recovery will be shown. Not for the faint-hearted!

So, just like SQL Connections the week before, serious amounts of info, tips, and tricks for you to geek-out on with us and take home to use. We hope to see a lot of you there!

Paul (and Kimberly)

Wednesday, October 03, 2007 7:58:18 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 
Tuesday, October 02, 2007

This is a follow-on article from two posts:

People get themselves into situations where they have no backups (or damaged backups) and the data or log files are damaged such that the only way to access the database is with EMERGENCY mode. In these situations, prior to SQL Server 2005, there was no documented or supported way to fix a database while in EMERGENCY mode - the only guidance could be found on the Internet or from calling Product Support and paying for help. The sequence of events was:

  1. Hack the system tables to get the database into 'emergency' mode.
  2. Use the undocumented and unsupported DBCC REBUILD_LOG command to build a new transaction log.
  3. Run DBCC CHECKDB with the REPAIR_ALLOW_DATA_LOSS option to fix up corruptions in the data files - both those that may have caused the issue, and those caused by rebuilding the transaction log (e.g. because an active transaction altering the database structure was lost).
  4. Figure out what data was lost or is transactionally inconsistent (e.g. because a transaction altering multiple tables was lost) as far as your business logic is concerned
  5. Take the database out of emergency mode
  6. And then all the other stuff like root-cause analysis and getting a better backup strategy

I decided to add a new feature to SQL Server 2005 called EMERGENCY mode repair that will do steps 2 and 3 as an atomic operation. The reasons for this were:

  • Much of the advice of how to do this on the Internet missed steps out (particularly missing step 3!)
  • The DBCC REBUILD_LOG command was unsupported and undocumented and we didn't like advising customers to use it
  • Adding a documented last-resort method of recovering from this situation would reduce calls to Product Support - saving time and money for customers and Microsoft.

So, when in EMERGENCY mode, you can use DBCC CHECKDB to bring the database back online again. The only repair option allowed in EMERGENCY mode is REPAIR_ALLOW_DATA_LOSS and it does a lot more than usual:

  • Forces recovery to run on the transaction log (if it exists). You can think of this as 'recovery with CONTINUE_AFTER_ERROR' - see this post for more details on the real CONTINUE_AFTER_ERROR option for BACKUP and RESTORE. The idea behind this is that the database is already inconsistent because either the transaction log is corrupt or something in the database is corrupt in such a way that recovery cannot complete. So, given that the database is inconsistent and we're about to rebuild the transaction log, it makes sense to salvage as much transactional information as possible from the log before we throw it away and build a new one.
  • Rebuild the transaction log - but only if the transaction log is corrupt.
  • Run DBCC CHECKDB with the REPAIR_ALLOW_DATA_LOSS option.
  • Set the database state to ONLINE.

It's a one-way operation and can't be rolled back. I always advise taking a copy of the database files before doing this in case something goes wrong or there are unrepairable errors. And if it does? Probably time to update your resume for not having a water-tight backup and disaster-recovery strategies in place. Saying that, I've never seen it fail. I can think of some pathalogical cases where it would fail though (involving the file system itself having problems) but that's really unlikely.

Let's walk-through an example of using it. I'm assuming there's a database called emergencydemo that's in the same state as at the end of the Search Engine Q&A #4 blog post - the database has no log file, is in EMERGENCY mode and the salaries table is corrupt.

First off I'll try bringing the database online, just to see what happens:

ALTER DATABASE emergencydemo SET ONLINE;

GO

File activation failure. The physical file name "C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\DATA\emergencydemo_log.LDF" may be incorrect.

The log cannot be rebuilt because the database was not cleanly shut down.

Msg 945, Level 14, State 2, Line 1

Database 'emergencydemo' cannot be opened due to inaccessible files or insufficient memory or disk space. See the SQL Server errorlog for details.

Failed to restart the current database. The current database is switched to master.

Msg 5069, Level 16, State 1, Line 1

ALTER DATABASE statement failed.

The first message makes sense - the database knows it needs to be recovered because it wasn't cleanly shut down, but the log file simply isn't there. The second message is from the new feature in 2005 that will automatically create a log file if one is missing on startup or attach - as long as the database was cleanly shut down. The 945 and 5069 errors are self-explanatory but notice that the database gets switched to master underneath us. This has bitten me several times in the past.

Well, I expected that not to work. Let's run EMERGENCY mode repair:

DBCC CHECKDB (emergencydemo, REPAIR_ALLOW_DATA_LOSS) WITH NO_INFOMSGS;

GO

Msg 945, Level 14, State 2, Line 1

Database 'emergencydemo' cannot be opened due to inaccessible files or insufficient memory or disk space. See the SQL Server errorlog for details.

Hmm - looks like the failed ALTER DATABASE statement did change the state - but what to?

SELECT state_desc FROM sys.databases WHERE name='emergencydemo';

GO

state_desc
------------------------------------------------------------
RECOVERY_PENDING

That makes sense I guess. Ok - back to emergency mode and run repair:

ALTER DATABASE emergencydemo SET EMERGENCY;

GO

DBCC CHECKDB (emergencydemo, REPAIR_ALLOW_DATA_LOSS) WITH NO_INFOMSGS;

GO

Msg 7919, Level 16, State 3, Line 1

Repair statement not processed. Database needs to be in single user mode.

EMERGENCY mode is not SINGLE_USER mode - a database needs to be in SINGLE_USER mode for repair to run and EMERGENCY mode allows multiple connections from members of the sysadmin role. You can set SINGLE_USER mode as well as EMERGENCY mode - however, the sys.databases field state_desc will still just say EMERGENCY.

ALTER DATABASE emergencydemo SET SINGLE_USER;

GO

DBCC CHECKDB (emergencydemo, REPAIR_ALLOW_DATA_LOSS) WITH NO_INFOMSGS;

GO

File activation failure. The physical file name "C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\DATA\emergencydemo_log.LDF" may be incorrect.

The log cannot be rebuilt because the database was not cleanly shut down.

Warning: The log for database 'emergencydemo' has been rebuilt. Transactional consistency has been lost. The RESTORE chain was broken, and the server no longer has context on the previous log files, so you will need to know what they were. You should run DBCC CHECKDB to validate physical consistency. The database has been put in dbo-only mode. When you are ready to make the database available for use, you will need to reset database options and delete any extra log files.

This time it worked. First of all we get the same error as if we tried to bring the database online - that's from the code that's trying to run 'recovery with CONTINUE_AFTER_ERROR' on the transaction log. Next we get a nice long warning that the transaction log has been rebuilt and the consequences of doing that (basically that you need to start a new log backup chain by taking a full backup). If there had been any corruptions we'd have seen the usual output from DBCC CHECKDB about what errors it found and fixed. There's also a bunch of stuff in the error log

2007-10-02 17:21:20.95 spid51      Starting up database 'emergencydemo'.
2007-10-02 17:21:20.96 spid51      Error: 17207, Severity: 16, State: 1.
2007-10-02 17:21:20.96 spid51      FileMgr::StartLogFiles: Operating system error 2(The system cannot find the file specified.) occurred while creating or opening file 'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\DATA\emergencydemo_log.LDF'. Diagnose and correct the operating system error, and retry the operation.
2007-10-02 17:21:20.96 spid51      Starting up database 'emergencydemo'.
2007-10-02 17:21:21.10 spid51      Starting up database 'emergencydemo'.
2007-10-02 17:21:21.18 spid51      Warning: The log for database 'emergencydemo' has been rebuilt. Transactional consistency has been lost. The RESTORE chain was broken, and the server no longer has context on the previous log files, so you will need to know what they were. You should run DBCC CHECKDB to validate physical consistency. The database has been put in dbo-only mode. When you are ready to make the database available for use, you will need to reset database options and delete any extra log files.
2007-10-02 17:21:21.18 spid51      Warning: The log for database 'emergencydemo' has been rebuilt. Transactional consistency has been lost. The RESTORE chain was broken, and the server no longer has context on the previous log files, so you will need to know what they were. You should run DBCC CHECKDB to validate physical consistency. The database has been put in dbo-only mode. When you are ready to make the database available for use, you will need to reset database options and delete any extra log files.
2007-10-02 17:21:21.99 spid51      EMERGENCY MODE DBCC CHECKDB (emergencydemo, repair_allow_data_loss) WITH no_infomsgs executed by ROADRUNNERPR\paul found 0 errors and repaired 0 errors. Elapsed time: 0 hours 0 minutes 1 seconds.

Note that the usual error log entry from running DBCC CHECKDB is preceded by 'EMERGENCY MODE' this time.

Checking the database state:

SELECT state_desc FROM sys.databases WHERE name='emergencydemo';

GO

state_desc
------------------------------------------------------------
ONLINE

we find that it's been brought back online again because everything worked. It's still SINGLE_USER though so let's make it MULTI_USER and see what happened to our table:

ALTER DATABASE emergencydemo SET MULTI_USER;

GO

USE EMERGENCYDEMO;

GO

SELECT * FROM salaries;

GO

FirstName            LastName             Salary
-------------------- -------------------- -----------
John                 Williamson           10000
Stephen              Brown                0
Jack                 Bauer                10000

(3 row(s) affected)

And of course its still corrupt - because even though the transaction log was rebuilt and repaired, the original transaction that changed the salary to 0 never got a chance to rollback becuase I deleted the transaction log (in the previous post).

Now remember, you should only use this as a last resort, but if you do get yourself into trouble, you know there's a command that should be able to help you.

Tuesday, October 02, 2007 3:26:42 PM (Pacific Standard Time, UTC-08:00)  #    Comments [7]  | 

SQL Server supports lock escalation - when the server decides to move from a large number of row or page locks on an object to a table-level lock. Sunil Agarwal posted a great description of lock escalation in SQL Server 2005 on the Storage Engine blog so I won't repeat it all here.

The problem with lock escalation is that it can be tricky to manage on systems that have conflicting requirements.

Disabling lock escalation

For example, if a table needs to support large batch updates with concurrent user queries, then having the batch update cause an escalation to a table-level exclusive lock prevents the user queries from running. There are a couple of documented trace flags that can be used to disable lock escalation:

  • 1211 - disables lock escalation totally and will allow lock memory to grow to 60% of dynamically allocated memory (non-AWE memory for 32-bit and regullar memory for 64-bit) and will then further locking will fail with an out-of-memory error
  • 1224 - disables lock escalation until 40% of memory is used and then re-enables escalation

The problem with these two trace flags are that they are instance-wide and turning them on can cause huge performance issues if a poorly-written application takes too many locks. It's not possible to disable lock escalation for a single table - until now!

SQL Server 2008 includes the ability to disable lock escalation per-table!! This is a fantastic step forward in concurrency management.

Changing the escalation mechanism

To extend the example above, what about if the table has multiple partitions? With the batch update only affecting a single partition and concurrent user queries going against other partitions, the escalation policy in SQL Server 2005 means that the batch update will escalate to a table-level exclusive lock and freeze out the user queries, even though they're going against different partitions. The only recourse is to disable lock escalation - until now!

SQL Server 2008 includes the ability to specify partition-level lock escalation instead of table-level lock escalation. And this is per-table! Very cool.

Summary

SQL Server 2008 will have ALTER TABLE syntax to specify per-table lock escalation management. The options will be:

  • Automatic determination of the level to escalate to. If the table is partitioned, locks will be escalated to the partition-level.
  • Table-level lock escalation (even if the table is partitioned).
  • Disable lock escalation

Once this feature is available in a CTP I'll blog about the syntax and supporting infrastructure, along with some examples.

Tuesday, October 02, 2007 2:06:34 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 

By far the most common search engine query leading to the blog is about fixing a suspect or unrecovered database. The very best way to do this is to use your backups, and to have a backup strategy that allows you to recover in the smallest amount of time and with no data loss. But what if you don't have a backup for some reason? Well, it depends what's damaged in the database and when the damage is noticed.

There are three states the database can be in when its damaged:

  1. ONLINE
    • If it's one of the data files that is damaged and the damage is hit during a query or some other operation then the database will remain online and accessible.
  2. RECOVERY PENDING
    • The database will be in this state if SQL Server knows that recovery needs to be run on the database but something is preventing recovery from starting. This is different from SUSPECT because there's nothing to say that recovery is going to fail - it just hasn't started yet.
    • An example of this is when the database wasn't cleanly shut down (i.e. there was at least one uncommitted transaction active at the time the database was shut down) and the log file has been deleted.
  3. SUSPECT
    • The database will be in this state if the transaction log is damaged and it prevents recovery or a transaction rollback from completing.
    • Some examples of this are:
      • When the database wasn't cleanly shut down and recovery tries to read a corrupt data page or comes across a corrupt log record.
      • A regular transaction rolls back and tries to read a corrupt data page or comes across a corrupt log record.

You can check the state of a database in the sys.databases catalog view:

SELECT state_desc FROM sys.databases WHERE name = 'master';

GO

or by using the DATABASEPROPERTYEX function:

SELECT DATABASEPROPERTYEX ('master', 'STATUS');

GO

Beware however, as DATABASEPROPERTYEX returns SUSPECT when the database is RECOVERY PENDING, as I'll show you below.

So the state the database is in determines what you can do if you don't have a backup. The easiest case is when it's still ONLINE. In this case you can probably run repair to remove the damage, most likely with some data loss (see my previous post on REPAIR_ALLOW_DATA_LOSS for more details), and then take steps to prevent the damage occuring again. If repair can't fix all the errors then your only option without a backup is to extract as much data as you can into a new database.

The other two database states are more difficult and are what's causing people to search for help. In this case the database isn't accessible at all, because recovery hasn't run or completed and so the database is in an inconsistent state. It could just be logically inconsistent (e.g. a transaction modifying data hasn't recovered) or worse it could structurally inconsistent (e.g. a system transaction modifying index linkages has't recovered). Either way, SQL Server wants to prevent you from getting into the database because it doesn't know what state the data and structures in the database are in. But if you don't have a backup, you need to get into the database, no matter what state things are in.

You can do this using EMERGENCY mode. In versions prior to SQL Server 2005, EMERGENCY mode wasn't documented and you had to hack the sysdatabases table to get a database into it (worse still, the exact hack to use was changed from version to version in a bid to obfuscate things). In 2005 though, EMERGENCY mode was documented and proper syntax added to support it. Members of the sysadmin role can put the database into EMERGENCY mode using:

ALTER DATABASE foo SET EMERGENCY;

GO

Once in EMERGENCY mode, the database is accessible only by members of the sy