Monday, October 01, 2007

Ok - I'm on a roll today so to finish off I'd like to repost some info about database mirroring failover types (including how you may not actually get a failover when you expect it) and a tip for how to avoid unwanted failovers when combining clustering and mirroring. Mirroring failovers have cropped up several times in my blog's search engine logs so this is an easy one to answer. This is based on a TechEd post from June on the Storage Engine blog.

Clustering, mirroring and altering the partner timeout

The first question is about combining failover clustering and database mirroring. A customer wanted to put the instance hosting the principal database on a failover cluster and have the mirror database on an unclustered server. The issue he saw with the setup was when something happened on the active cluster server and it failed over. The failover would take 60-90 seconds, which means the principal database would not be available during that time.

The problem stems from the way mirroring figures out whether to initiate a failover. The mechanism it uses is to have all SQL Server instances within a mirroring session send out 'pings' on each of its open mirroring connections every second. If an instance does not receive a ping from one of the other instances in the session within a defined time-out period (called the partner time-out), it considers the other instance as unavailable and takes the appropriate action. The default time-out period is ten seconds.

In the case above, when the cluster is failing over and the principal instance is unavailable for 60-90 seconds, the other instances in the mirroring session will time-out after ten seconds and initiate a mirroring failover. Although this is the correct behavior of mirroring, it is undesirable behavior for this customer. The solution to the problem is to change the partner time-out value - a facility that isn't well known.

To change the partner time-out value for a mirroring session, use the following code:

ALTER DATABASE mydatabase SET PARTNER TIMEOUT 90;

GO

The value at the end of the statement is the new partner timeout value in seconds. There are a couple of things to be aware of if you're going to do this:

  • You can only issue this statement on the principal server.
  • Be very careful not to set the time-out value too low otherwise you run the risk of triggering failovers becuase of false-failures - especially on heavily-loaded systems that may not be able to respond within the time-out period. In fact, if you specify a time-out value of 4 seconds or lower, SQL Server will automatically set the time-out to 5 seconds to help avoid such problems.

Database mirroring failover types

What are the different kinds of failures that can trigger mirroring failovers, and how quickly does the failover happen after the problem occurs? As with most questions I get, I can use my favorite answer of "It depends!" :-) Let's look at some examples of failures and see how quickly the failover occurs, in decreasing order of speed.

  • Fastest: The fastest possible failover occurs when the SQL Server instance crashes (so the mirroring connection endpoint no longer exists) but the operating system is still running. When the next ping comes from a partner instance, the OS knows that the network port (that was being used by the mirroring connection endpoint) is no longer being listened to and returns a failure. This immediately triggers a failover.
  • Fast: The next fastest failover occurs when the machine hosting the SQL Server instance crashes or shuts down (e.g. power supply cord pulled out of the back). In this case, nothing happens until the partner time-out period has expired (as there's no OS running to return the immediate failure) and then a failover will occur. By default this will be in ten seconds, but you can change this as I explained above.
  • Slow: A slow failover is when something happens to the server but it takes a while for the database to go offline. An example of this would be someone pulling out the transaction log drive on the principal server. Writes to the transaction log will start to queue up. After 20 seconds SQL Server will issue an IO warning but it isn't until 40 seconds has passed that SQL Server issues an IO failure and the database goes suspect. It's not until this point that the failover occurs - even though the partner timeout value is 10 seconds!
  • Maybe fast or not at all: This discussion all started by someone asking about how quickly a failover happens if a page checksum failure is detected. My answer was - it depends! If the page checksum failure happens during a query, then all that happens is that the checksum failure is reported, the query rolls back, and the connection is broken. It's only if the page checksum failure happens during a transaction rollback that a failover will be triggered. This is because a transaction rollback failure means the database is in a transactionally inconsistent state and has to be set to suspect mode - which triggers a mirroring failover.

So, don't assume that just because mirroring is setup that every failure will trigger a fast failover.

Monday, October 01, 2007 3:31:44 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 

Part of the improvements to database mirroring in SQL Server 2008 are the addition of 10 new performance counters under the Database Mirroring Performance Object. These add some cool troubleshooting capabilities.

SQL Server 2005 provided the following 11 counters (from Books Online):

Name Description

Bytes Received/Sec

Number of bytes received per second.

Bytes Sent/sec

Number of bytes sent per second.

Log Bytes Received/sec

Number of bytes of log received per second.

Log Bytes Sent/sec

Number of bytes of log sent per second.

Log Send Queue

Total number of bytes of log that have not yet been sent to the mirror server.

Pages Sent/sec

Number of pages sent per second.

Receives/sec

Number of mirroring messages received per second.

Redo Bytes/sec

Number of bytes of log rolled forward on the mirror database per second.

Redo Queue

Total number of bytes of hardened log that currently remain to be applied to the mirror database to roll it forward. This is sent to the Principal from the Mirror.

Sends/sec

Number of mirroring messages sent per second.

Transaction Delay

Delay in waiting for unterminated commit acknowledgement.

SQL Server 2008 now provides 21 counters, with the new ones highlighted in red. This info is taken from the 2008 July CTP 08Books Online that is downloadable here.

Name Description

Bytes Received/sec

Number of bytes received per second.

Bytes Sent/sec

Number of bytes sent per second.

Log Bytes Received/sec

Number of bytes of log received per second.

Log Bytes Redone from Cache/sec

Number of redone log bytes that were obtained from the mirroring log cache, in the last second.

This counter is used on only the mirror server. On the principal server the value is always 0.

Log Bytes Sent from Cache/sec

Number of sent log bytes that were obtained from the mirroring log cache, in the last second.

This counter is used on only the principal server. On the mirror server the value is always 0.

Log Bytes Sent/sec

Number of bytes of log sent per second.

Log Compressed Bytes Rcvd/sec

Number of compressed bytes of log received, in the last second.

Log Compressed Bytes Sent/sec

Number of compressed bytes of log sent, in the last second.

Log Harden Time (ms)

Milliseconds that log blocks waited to be hardened to disk, in the last second.

Log Remaining for Undo KB

Total kilobytes of log that remain to be scanned by the new mirror server after failover.

This counter is used on only the mirror server during the undo phase. After the undo phase completes, the counter is reset to 0. On the principal server the value is always 0.

Log Scanned for Undo KB

Total kilobytes of log that have been scanned by the new mirror server since failover.

This counter is used on only the mirror server during the undo phase. After the undo phase completes, the counter is reset to 0. On the principal server the value is always 0.

Log Send Flow Control Time (ms)

Milliseconds that log stream messages waited for send flow control, in the last second.

Sending log data and metadata to the mirroring partner is the most data-intensive operation in database mirroring and might monopolize the database mirroring and Service Broker send buffers. Use this counter to monitor the use of this buffer by the database mirroring session.

Log Send Queue KB

Total number of kilobytes of log that have not yet been sent to the mirror server.

Mirrored Write Transactions/sec

Number of transactions that wrote to the mirrored database and waited for the log to be sent to the mirror in order to commit, in the last second.

This counter is incremented only when the principal server is actively sending log records to the mirror server.

Pages Sent/sec

Number of pages sent per second.

Receives/sec

Number of mirroring messages received per second.

Redo Bytes/sec

Number of bytes of log rolled forward on the mirror database per second.

Redo Queue KB

Total number of kilobytes of hardened log that currently remain to be applied to the mirror database to roll it forward. This is sent to the Principal from the Mirror.

Send/Receive Ack Time

Milliseconds that messages waited for acknowledgement from the partner, in the last second.

This counter is helpful in troubleshooting a problem that might be caused by a network bottleneck, such as unexplained failovers, a large send queue, or high transaction latency. In such cases, you can analyze the value of this counter to determine whether the network is causing the problem.

Sends/sec

Number of mirroring messages sent per second.

Transaction Delay

Delay in waiting for unterminated commit acknowledgement.

Here's a little more explanation and what you can use these new performance counters to troubleshoot:

  • Log Bytes Redone from Cache/sec
    • This measures how much of the transaction log in the redo queue is being read by the log redo task from the mirror's in-memory transaction log cache. Reading from the cache is a lot faster than having to read from the mirror's actual transaction log. Even though the log gets hardened on the mirror database's log disk, it does not need to be removed from the cache until the cache fills up with new transaction log from the principal.
    • You could think of this as a cache hit ratio measure for the redo queue.
    • If this number is lower than usual, it means that transaction log is arriving from the principal faster than the log redo task can roll forward the transaction log in the redo queue.
  • Log Bytes Sent from Cache/sec
    • This is similar to the counter above. It measures how much of the transaction log being sent from the principal to the mirror is being read from the principal's in-memory transaction log cache. Sending from the cache is a lot faster than having to go to the transaction log itself and read from disk.
    • You could think of this as a cache hit ratio for the send queue.
    • If this number is lower than usual it means that the transaction log is being generated on the principal faster than it can be sent to the mirror.
  • Log Compressed Bytes Rcvd/sec
  • Log Compressed Bytes Sent/sec
    • These are simple counters and can be used with the Log Bytes Sent/sec and Rcvd/sec counters to determine the compression ratio. Log stream compression is another enhancement in SQL Server 2008 that I'll cover in a future post.
  • Log Harden Time (ms)
    • This measures the delay between the mirror server receiving a chunk of transaction log and it being hardened on the mirror database's log disk (i.e. the delay before the chunk of transaction log becomes part of the redo queue on the mirror server).
    • If this number is higher than normal it means the mirror database's log disk is more heavily loaded and may be  becoming saturated.
  • Log Remaining for Undo KB
  • Log Scanned for Undo KB
    • The Books Online entries for these counters are self-explanatory.
    • These counters give a way to monitor the undo phase after a failover occurs.
  • Log Send Flow Control Time (ms)
    • This measures how long a mirroring connection had to wait before it could us the mirroring flow control buffer.
    • If this number is higher than normal it means there is contention for the buffer, most likely because there are too many Database Mirroring partnerships running from a single instance.
  • Mirrored Write Transactions/sec
    • As Books Online mentions, this counts the number of transactions in the principal database that had to wait for a commit record to harden in the mirror database's transaction log.
    • If this value is lower than normal (for the same application workload) it means there is a bottleneck somewhere in the system.
  • Send/Receive Ack Time
    • As Books Online mentions, this can be used to measure network latency between the principal and mirror servers.
    • If this value is larger than normal it means that there is a network bottleneck between the principal and mirror servers.

Hopefully Microsoft will publish a whitepaper or some troubleshooting scenarios showing these counters being used.

Monday, October 01, 2007 12:31:38 PM (Pacific Standard Time, UTC-08:00)  #    Comments [1]  | 
Sunday, September 30, 2007

Time for the first post in the Inside the Storage Engine series. I'm going to focus on SQL Server 2005 in this series and I'll point out major differences between 2005 and previous versions. Please drop me a line if there's something you'd like to see explained and demo'd.

Before jumping into how things work, I'd like to go over two commands I'll be using a lot - DBCC PAGE and DBCC IND. These are both undocumented and unsupported commands, but are very safe to use as they're used extensively inside and outside Microsoft when troubleshooting. Nevertheless, use at your own risk. They're quite well known in the SQL community and I and others have publicized them before (I even demo'd them last year at ITForum in Spain).

To illustrate their use, I'm going to use a simple script I wrote to prove that page splits never roll back. I was having a discussion with someone a while ago about this question and the answer is always no. A page split occurs when an insert or update has to happen at a certain point in an index page, and there's no room on the page to accomomodate the new or updated record. Page splits are done internally as separate 'system' transactions. Once a system transaction commits, it cannot be rolled back - even if the user transaction it was part of rolls back.

So, let's run through the script. First thing to do is create a database containing a table with an index (as page splits only happen in indexes).

USE MASTER;

GO

IF DATABASEPROPERTY (N'pagesplittest', 'Version') > 0 DROP DATABASE pagesplittest;

GO

CREATE DATABASE pagesplittest;

GO

USE pagesplittest;

GO

CREATE TABLE t1 (c1 INT, c2 VARCHAR (1000));

CREATE CLUSTERED INDEX t1c1 ON t1 (c1);

GO

Now I'm going to fill up a page in the index, but leave a gap in the c1 values so that I can force a page split by inserting the missing key value.

INSERT INTO t1 VALUES (1, REPLICATE ('a', 900));

INSERT INTO t1 VALUES (2, REPLICATE ('b', 900));

INSERT INTO t1 VALUES (3, REPLICATE ('c', 900));

INSERT INTO t1 VALUES (4, REPLICATE ('d', 900));

-- leave a gap at 5

INSERT INTO t1 VALUES (6, REPLICATE ('f', 900));

INSERT INTO t1 VALUES (7, REPLICATE ('g', 900));

INSERT INTO t1 VALUES (8, REPLICATE ('h', 900));

INSERT INTO t1 VALUES (9, REPLICATE ('i', 900));

GO

I can find out what the first index page is using the DBCC IND command:

DBCC IND ('pagesplittest', 't1', 1);

GO

This command list all the pages that are allocated to an index. Here's the output in this case:

SEQA3.jpg

The columns mean:

  • PageFID - the file ID of the page
  • PagePID - the page number in the file
  • IAMFID - the file ID of the IAM page that maps this page (this will be NULL for IAM pages themselves as they're not self-referential)
  • IAMPID - the page number in the file of the IAM page that maps this page
  • ObjectID - the ID of the object this page is part of
  • IndexID - the ID of the index this page is part of
  • PartitionNumber - the partition number (as defined by the partitioning scheme for the index) of the partition this page is part of
  • PartitionID - the internal ID of the partition this page is part of
  • PageType - the page type. Some common ones are:
    • 1 - data page
    • 2 - index page
    • 3 and 4 - text pages
    • 8 - GAM page
    • 9 - SGAM page
    • 10 - IAM page
    • 11 - PFS page
  • IndexLevel - what level the page is at in the index (if at all). Remember that index levels go from 0 at the leaf to N at the root page (except in clustered indexes in SQL Server 2000 and 7.0 - where there's a 0 at the leaf level (data pages) and a 0 at the next level up (first level of index pages))
  • NextPageFID and NextPagePID - the page ID of the next page in the doubly-linked list of pages at this level of the index
  • PrevPageFID and PrevPagePID - the page ID of the previous page in the doubly-linked list of pages at this level of the index

So you can see we've got a single page clustered index with an IAM page. Note that the page IDs returned may differ on your server. Let's look at the data page:

DBCC TRACEON (3604);

GO

DBCC PAGE (pagesplittest, 1, 143, 3);

GO

The traceflag is to make the output of DBCC PAGE go to the console, rather than to the error log. The syntax for DBCC PAGE is:

dbcc page ( {'dbname' | dbid}, filenum, pagenum [, printopt={0|1|2|3} ])

The filenum and pagenum parameters are taken from the page IDs that come from various system tables and appear in DBCC or other system error messages. A page ID of, say, (1:143) has filenum = 1 and pagenum = 143.

The printopt parameter has the following meanings:

  • 0 - print just the page header
  • 1 - page header plus per-row hex dumps and a dump of the page slot array (unless its a page that doesn't have one, like allocation bitmaps)
  • 2 - page header plus whole page hex dump
  • 3 - page header plus detailed per-row interpretation

The per-row interpretation works for all page types, including the allocation bitmaps. In our case, we asked for a detailed output. I'll explain the various parts of the output in a post about the anatomy of a page. Here's the output from DBCC PAGE, with a bunch of the repeated per-row info removed for brevity:

PAGE: (1:143)


BUFFER:


BUF @0x02C49720

bpage = 0x05400000                   bhash = 0x00000000                   bpageno = (1:143)
bdbid = 8                            breferences = 0                      bUse1 = 22163
bstat = 0xc0010b                     blog = 0x32159bb                     bnext = 0x00000000

PAGE HEADER:


Page @0x05400000

m_pageId = (1:143)                   m_headerVersion = 1                  m_type = 1
m_typeFlagBits = 0x4                 m_level = 0                          m_flagBits = 0x8000
m_objId (AllocUnitId.idObj) = 68     m_indexId (AllocUnitId.idInd) = 256 
Metadata: AllocUnitId = 72057594042384384                                
Metadata: PartitionId = 72057594038386688                                 Metadata: IndexId = 1
Metadata: ObjectId = 2073058421      m_prevPage = (0:0)                   m_nextPage = (0:0)
pminlen = 8                          m_slotCnt = 8                        m_freeCnt = 744
m_freeData = 7432                    m_reservedCnt = 0                    m_lsn = (18:113:2)
m_xactReserved = 0                   m_xdesId = (0:0)                     m_ghostRecCnt = 0
m_tornBits = 0                      

Allocation Status

GAM (1:2) = ALLOCATED                SGAM (1:3) = ALLOCATED              
PFS (1:1) = 0x60 MIXED_EXT ALLOCATED   0_PCT_FULL                         DIFF (1:6) = CHANGED
ML (1:7) = NOT MIN_LOGGED           

Slot 0 Offset 0x60 Length 917

Record Type = PRIMARY_RECORD         Record Attributes =  NULL_BITMAP VARIABLE_COLUMNS

Memory Dump @0x5C24C060

00000000:   30000800 01000000 0300f802 00110095 †0...............        
00000010:   03616161 61616161 61616161 61616161 †.aaaaaaaaaaaaaaa        

<snip> I've removed this section to save space

00000380:   61616161 61616161 61616161 61616161 †aaaaaaaaaaaaaaaa        
00000390:   61616161 61††††††††††††††††††††††††††aaaaa                   
UNIQUIFIER = [NULL]                 

Slot 0 Column 1 Offset 0x4 Length 4

c1 = 1                              

Slot 0 Column 2 Offset 0x11 Length 900

c2 = aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
a                                   

Slot 1 Offset 0x3f5 Length 917

Record Type = PRIMARY_RECORD         Record Attributes =  NULL_BITMAP VARIABLE_COLUMNS

Memory Dump @0x5C24C3F5

<snip> And again...

DBCC execution completed. If DBCC printed error messages, contact your system administrator.

As you can see from the output, each row is 917 bytes long and there's only 744 bytes free (look at the m_freecnt value in the PAGE HEADER section). This means that we can't insert another row on that page of the same length - there just isn't space - but that's what we're going to do! Remember that this page currently has nine rows on it. Let's force a page split:

BEGIN TRAN;

GO

INSERT INTO t1 VALUES (5, REPLICATE ('a', 900));

GO

Now we know there wasn't enough room so the page must have split. Let's check DBCC IND again to see if another page was allocated to the index - here's the output:

SEQA41.jpg

Two pages have been added - an index page and another data page. Before we added the extra row and caused the page split, the index only needed one page. Now that there are two data pages, there needs to be an index page to allow searches through the index b-tree. Let's take a look at the two data pages to see which rows are stored on which page. Doing DBCC PAGE on them shows that page (1:143) has 5 rows, with c1 values 1 through 5, and page (1:154) has the 4 rows with c1 values 6 through 9 (I'm not going to post all the DBCC PAGE output - that would make the post way too long and it gives you an incentive to try the commands out). This is what we'd expect, as the page split occurs at the insertion point, and the row being inserted is put onto the page that split.

Now let's rollback the user transaction and see what happens:

ROLLBACK TRAN;

GO

Running the DBCC PAGE commands again shows that the index structure remains the same as after the split. Page (1:154) has the 4 rows on it from the split and page (1:143) has the other rows but not the one we inserted in the explicit transaction.

So, proof that a page split is never rolled back. I'll be making much more use of these two DBCC commands in future posts and I'll do the page anatomy one later this week. Let me know if there's anything in particular you'd like to see described in this series.

Sunday, September 30, 2007 5:53:53 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 

This week I'm going to post a bunch of info on the basic structures used to store data and track allocations in SQL Server. A bunch of this was posted back when I started blogging at TechEd 2006 but I want to consolidate/clarify info and add more about using DBCC PAGE to examine the various structures.

So, what are records? At the simplest level, a record is the physical storage associated with a table or index row. Of course, it gets much more complicated than that...

Data records

  • Data records are stored on data pages.
  • Data records store rows from a heap or the leaf level of a clustered index.
  • A data record always stores all columns from a table row - either by-value or by-reference.
    • If any columns are for LOB data types (text, ntext, image, and the new LOB types in SQL Server 2005 - varchar(max), nvarchar(max), varbinary(max), XML), then there's a pointer stored in the data record which points to a text record on a different page (the root of a loose tree that stores the LOB value). Exceptions to this are when the schema has been set to store LOB columns 'in-row' when possible. This is when a LOB value is small enough to fit within the size limits of a data record. This is a performance benefit as selecting the LOB column does not require an extra IO to read the text record.
    • In SQL Server 2005, non-LOB variable length columns (e.g. varchar, sqlvariant) may also be stored 'off-row' as part of the row-overflow feature of having table rows longer than 8060 bytes. In this case the storage format is the same as for LOB values - a pointer in the data record pointing to a text record.
  • There is a difference to how the columns are laid out between heaps and clustered indexes - I'll cover that in a later post.

Forwarded/Forwarding records

  • These are technically data records and are only present in a heap.
  • A forwarded record is a data record in a heap that was updated and was too large to fit in-place on its original page and so has been moved to another page. It contains a back-pointer to the forwarding record.
  • A forwarding record is left in its place and points to the new location of the record. It's sometimes known as a forwarding-stub, as all it contains is the location of the real data record.
  • This is done to avoid having to update any non-clustered index records that point back directly to the original physical location of the record.
  • Although this optimizes non-clustered index maintenance during updates, it can cause additional IOs during SELECTs. This is because the non-clustered index record points to the old location of the index, so an extra IO might be needed to read the real location of the data row. This is fuel for the heap vs clustered index debate, in favor of clustered indexes.

Index records

  • Index records are stored on index pages.
  • There are two types of index records (which differ only in what columns they store):
    1. Those that store non-clustered index rows at the leaf level of a non-clustered index
    2. Those that comprise the b-tree that make up clustered and non-clustered indexes (i.e. in index pages above the leaf level of a clustered or non-clustered index)
  • I'll explain more about the differences between these in a later post as it can be quite complicated (especially the differences between SQL Server 2000 and 2005) and is worth doing in separate posts.
  • Index records typically do not contain all the column values in a table (although some do - called covering indexes).
  • In SQL Server 2005, non-clustered index records can include LOB values as included columns (with the storage details exactly the same as for data records) and also can have row-overflow data that is pushed off-row (again, in exactly the same way as for data records).

Text records

  • Text records are stored on text pages.
  • There are various types of text records that comprise the tree structure that stores LOB values, stored on two types of text page. I'll explain how they work and are linked together in a future post.
  • They are also used to store variable-length column values that have been pushed out of data or index records as part of the row-overflow capability.

Ghost records

  • These are records that have been logically deleted but not physically deleted from a page. The reasons for this are complicated, but basically having ghost records simplfies key-range locking and transaction rollback.
  • The record is marked with a bit that indicates it's a ghost record and cannot be physically deleted until the transaction that caused it to be ghosted commits. Once this is done, it is deleted by an asynchronous background proces (called the ghost-cleanup task) or it is converted back to a real record by an insert of a record with the exact same set of keys.

Other record types

  • There are also records that are used to store various allocation bitmaps, intermediate results of sort operations, and file and database metadata (e.g. in the per-file fileheader page and database boot page). Again, I'll go into these in later posts (there's a big queue of posts building up :-))

Record structure

All records have the same structure, regardless of their type and use, but the number and type of columns will be different. For instance, a data record from a table with a complex schema may have hundreds of columns of various types whereas an allocation bitmap record will have a single column, filling up the whole page.

The record structure is as follows:

  • record header
    • 4 bytes long
    • two bytes of record metadata (record type)
    • two bytes pointing forward in the record to the NULL bitmap
  • fixed length portion of the record, containing the columns storing data types that have fixed lengths (e.g. bigint, char(10), datetime)
  • NULL bitmap
    • two bytes for count of columns in the record
    • variable number of bytes to store one bit per column in the record, regardless of whether the column is nullable or not (this is different and simpler than SQL Server 2000 which had one bit per nullable column only)
    • this allows an optimization when reading columns that are NULL
  • variable-length column offset array
    • two bytes for the count of variable-length columns
    • two bytes per variable length column, giving the offset to the end of the column value
  • versioning tag
    • this is in SQL Server 2005 only and is a 14-byte structure that contains a timestamp plus a pointer into the version store in tempdb

NULL bitmap optimization

So why is the NULL bitmap an optimization?

Firstly, having a null bitmap removes the need for storing special 'NULL' values for fixed-length datatypes. Without the null bitmap, how can you tell whether a column is NULL? For fixed-length columns you'd need to define a special 'NULL' value, which limits the effective range of the datatype being stored. For varchar columns, the value could be a zero-length empty string, so just checking the length doesn't work - you'd need the special value again. For all other variable-length data types you can just check the length. So, we nede the NULL bitmap.

Secondly, it saves CPU cycles. If there was no NULL bitmap, then there are extra instructions executed for fixed-length and variable-length columns.

For fixed-length:

  1. read in the stored column value (possibly taking a cpu data cache miss)
  2. load the pre-defined NULL value for that datatype (possibly taking a cpu data cache miss, but only for the first read in the case of a multiple row select)
  3. do a comparison between the two values

For variable-length:

  1. calculate the offset of the variable length array
  2. read the number of variable length columns (possibly taking a cpu data cache miss)
  3. calculate the position in the variable length offset array to read
  4. read the column offset from it (possibly taking a cpu data cache miss)
  5. read the next one too (possibly taking another cpu data cache miss, if the offset in step 4 was on the boundary of a cache line size)
  6. compare them to see if they're the same

But with a NULL bitmap, all you have to do is:

  1. read the NULL bitmap offset (possibly taking a cpu data cache miss)
  2. calculate the additional offset of the NULL bit you want to read
  3. read it (possibly taking a cpu data cache miss)

So, its about even for a lookup of a single fixed-length column, but for variable-length columns, and for multiple row selects, there's a clear advantage to having the NULL bitmap.

Using DBCC IND and DBCC PAGE to examine a row in detail

Let's create an example table to look at:

USE MASTER;

GO

IF DATABASEPROPERTY (N'recordanatomy', 'Version') > 0 DROP DATABASE recordanatomy;

GO

CREATE DATABASE recordanatomy;

GO

USE recordanatomy;

GO

CREATE TABLE example (

destination VARCHAR(100),

activity VARCHAR(100),

duration INT);

GO

INSERT INTO example VALUES ('Banff', 'sightseeing', 5);

INSERT INTO example VALUES ('Chicago', 'sailing', 4);

GO

And we can use DBCC IND again to find the page to look at:

DBCC IND ('recordanatomy', 'example', 1);

GO

The output tells us the data page is (1:143) so we can dump it with DBCC PAGE, using option 3 to get a fully interpreted dump of each record.

DBCC TRACEON (3604);

GO

DBCC PAGE ('recordanatomy', 1, 143, 3);

GO

Remember we need the trace-flag to make the DBCC PAGE output go to the console instead of the error log. The output will contain something like the following:

Slot 0 Offset 0x60 Length 33

Record Type = PRIMARY_RECORD         Record Attributes =  NULL_BITMAP VARIABLE_COLUMNS

Memory Dump @0x5C76C060

00000000:   30000800 05000000 0300f802 00160021 †0..............!        
00000010:   0042616e 66667369 67687473 6565696e †.Banffsightseein        
00000020:   67†††††††††††††††††††††††††††††††††††g                       

Slot 0 Column 0 Offset 0x11 Length 5

destination = Banff                 

Slot 0 Column 1 Offset 0x16 Length 11

activity = sightseeing              

Slot 0 Column 2 Offset 0x4 Length 4

duration = 5                        

Let's use the record structure I listed above to go through this record and see how things are stored.

  • Byte 0 is the TagA byte of the record metadata.
    • Its 0x30, which corresponds to 0x10 (bit 4) and 0x20 (bit 5). Bit 4 means the record has a NULL bitmap and bit 5 means the record has variable length columns. If 0x40 (bit 6) was also set, that would indicate that the record has a versioning tag. If 0x80 (bit 7) was also set, that would indicate that byte 1 has a value in it.
    • Bits 1-3 of byte 0 give the record type. The possible values are:
      • 0 = primary record. A data record in a heap that hasn't been forwarded or a data record at the leaf level of a clustered index.
      • 1 = forwarded record
      • 2 = forwarding record
      • 3 = index record
      • 4 = blob fragment
      • 5 = ghost index record
      • 6 = ghost data record
      • 7 = ghost version record. A special 15-byte record containing a single byte record header plus a 14-byte versioning tag that is used in some circumstances (like ghosting a versioned blob record)
    • In our example, none of these bits are set which means the record is a primary record. If the record was an index record, byte 0 would have the value 0x36. Remember that the record type starts on bit 1, not bit 0, and so the record type value from the enumeration above needs to be shifted left a bit (multiplied by two) to get its value in the byte.
  • Byte 1 is the TagB byte of the record metadata. It can either be 0x00 or 0x01. If it is 0x01, that means the record type is ghost forwarded record. In this case its 0x00, which is what we expect given the TagA byte value.
  • Bytes 2 and 3 are the offset of the NULL bitmap in the record. This is 0x0008 (DBCC PAGE presents multi-byte values in hex dumps as least-significant byte first). This means that there's a 4-byte fixed length portion of the record starting at byte 4. We expect this because we know the table schema.
  • Bytes 4 to 7 are the fixed length portion. Again, because we know the table schema, we know to interpret these bytes as a 4-byte integer. Without that knowledge, you'd have to guess. The value therefore is 0x00000005, which is what we'd expect to see as the value of the duration column.
  • Bytes 8 and 9 are the count of columns in the record. This is 0x0003 which is correct. Given that there are only 3 columns, the NULL bitmap of one bit per column will fit in a single byte.
  • Byte 10 is the NULL bitmap. The value is 0xF8. We need to convert it to binary to make sense of the value. 0xF8 = 11111000. This makes sense - bits 0-2 represent columns 1-3 and they're all 0, meaning the columns aren't NULL. Bits 3-7 represent non-existent columns and they're set to 1 for clarity.
  • Bytes 11 and 12 are the count of variable length columns in the record. That value is 0x0002, which we again know to be correct. This means there will be two two-byte entries in the variable length column offset array. These will be bytes 13-14 and 15-16, having values of 0x0016 and 0x0021 respectively. Remember that NULL bitmap entries point to the end of the column value - this is done so that we know how long each column is without having to store their length as well.
  • So, the final offset is bytes 15 and 16, which means the offset of the start of the first variable length column must be byte 17 (or 0x11 in hex), which agrees with the DBCC PAGE dump. The offset of the end of the first variable length column is 0x0016, so the first value is from byte 17 to byte 21 inclusive. This value is 0x42616E6666. We know from the table metadata that this is the first varchar column, destination. Converting to ASCII gives us the column value 'Banff'. Using similar logic, the second value is from byte 22 to byte 32 inclusive and has the value 'sightseeing'. Both of these match the data we're expecting.

And that's it.

Some of the features of SQL Server 2008 will introduce changes to the record structure - more on those when the features are available in CTPs.

Sunday, September 30, 2007 8:57:32 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2]  | 
Friday, September 28, 2007

This was a question from the MSDN Disaster Recovery forum I started while I was at Microsoft.

I have a 600 gig database that has a mirror.  I need to move the databases from local drives to a SAN.  Can anyone recommend a document that lists the steps to go through to move both the principle and mirror to the SAN with no down time? or minimal down time?

As far as I know, there isn't any such document so I had a crack at coming up with a list of operations. Here's what I had:

  1. Take a full backup of the principal on node A
  2. Restore it on the SAN on node B using WITH NORECOVERY, remembering to use WITH MOVE to place the files correctly, and with a different database name than the current mirror
  3. Take the required log backup on the principal and restore on the database copy on the SAN on node B
  4. Break the mirroring partnership
  5. Drop the current mirror database on node B
  6. Rename the database on the SAN on node B to be the mirror database -- THIS DOESN"T WORK!
  7. Setup the mirroring partnership to point to the newly restored database on the SAN on node B
  8. Start mirroring and the new mirror will catch-up
  9. Failover to the mirror on node B, which becomes the new principal
  10. Follow the same procedure to move the new mirror on node A onto its SAN
  11. Failback if you want to

And I promised to try it out to make sure I had it right so in this blog post I'm going to walk through the steps of doing this. It turns out that the steps above are slightly incorrect. Step 6 above doesn't work because the database is in recovery (so is inaccessible) and there's a short-cut when moving the database on the first node to avoid having to take and copy more backups. Let's see how it works and I'll post the corrected sequence at the end.

As I did in yesterday's mirroring post, I'm going to use the TicketSalesDB database from our Always-On DVDs. It's only a few hundred MB instead of 600GB but the principal is the same (no pun intended :-)). I've got mirroring running between two nodes, SQLDEV01 (the principal) and SQLDEV02 (the mirror), both of which are running 2005 SP2 and I've got a simulated workload inserting rows into the database. I don't actually have a SAN laying around so I'm cheating and I have directories called C:\SQLDEV01SAN and C:\SQLDEV02SAN instead. It's the location change that's the interesting part, not where the actual location is.

Step 1

On SQLDEV01, take a full backup and a log backup:

BACKUP DATABASE TicketSalesDB TO DISK = 'C:\SQLskills\TicketSalesDB.BAK' WITH INIT;

GO

BACKUP LOG TicketSalesDB TO DISK = 'C:\SQLskills\TicketSalesDB_Log.bak' WITH INIT;

GO

Step 2

On SQLDEV01, break the mirroring partnership:

ALTER DATABASE TicketSalesDB SET PARTNER OFF;

GO

And just check that it's gone:

SELECT mirroring_state_desc FROM sys.database_mirroring WHERE database_id = DB_ID ('TicketSalesDB');

GO

which returns:

NULL

Step 3

On SQLDEV02, drop the mirror database - this wouldn't work unless mirroring was no longer running:

DROP DATABASE TicketSalesDB;

GO

Step 4

Copy the backups to SQLDEV02 and restore them on the SAN and remembering to use WITH NORECOVERY:

RESTORE DATABASE TicketSalesDB FROM DISK='C:\SQLskills\TicketSalesDB.bak'

WITH

MOVE 'TicketSalesDBData' TO 'C:\SQLDEV02SAN\TicketSalesDBData.MDF',

MOVE 'TicketSalesFG2005Q1' TO 'C:\SQLDEV02SAN\TicketSalesFG2005Q1.NDF',

MOVE 'TicketSalesFG2005Q2' TO 'C:\SQLDEV02SAN\TicketSalesFG2005Q2.NDF',

MOVE 'TicketSalesFG2005Q3' TO 'C:\SQLDEV02SAN\TicketSalesFG2005Q3.NDF',

MOVE 'TicketSalesFG2005Q4' TO 'C:\SQLDEV02SAN\TicketSalesFG2005Q4.NDF',

MOVE 'TicketSalesDBLog' TO 'C:\SQLDEV02SAN\TicketSalesDBLog.LDF',

NORECOVERY;

GO

RESTORE LOG TicketSalesDB FROM DISK = 'C:\SQLskills\TicketSalesDB_Log.bak' WITH NORECOVERY;

GO

Step 5

On SQLDEV02, set the mirroring partner to be SQLDEV01:

ALTER DATABASE TicketSalesDB SET PARTNER = 'TCP://SQLDEV01:5091';

GO

Step 6

On SQLDEV01, start mirroring:

ALTER DATABASE TicketSalesDB SET PARTNER = 'TCP://SQLDEV02:5092';

GO

And check that it's running:

SELECT mirroring_state_desc FROM sys.database_mirroring WHERE database_id = DB_ID ('TicketSalesDB');

GO

This time it returns:

SYNCHRONIZED

Step 7

Now we need to failover so that we can move the database on SQLDEV01 onto its SAN. Before we do that, let's make sure that SQLDEV01 is the principal:

SELECT mirroring_role_desc FROM sys.database_mirroring WHERE database_id = DB_ID ('TicketSalesDB');

GO

which returns:

PRINCIPAL

Now force the failover:

ALTER DATABASE TicketSalesDB SET PARTNER FAILOVER;

GO

And query the DMV again to make sure. This time the mirroring_state_desc returned is:

MIRROR

Excellent!

Now, I did all of this while my workload was running and it automatically failed over to SQLDEV02, with the database now hosted on the SAN. To do the same move on SQLDEV01, we don't need to go through the backup and copy process again - we can just use the original backups we took in step 1.

Step 8

We need to break the mirroring partnership again, this time executing on SQLDEV02, the new principal:

ALTER DATABASE TicketSalesDB SET PARTNER OFF;

GO

On SQLDEV01, we can now drop the database and restore the original backups onto the SAN:

DROP DATABASE TicketSalesDB;

GO

RESTORE DATABASE TicketSalesDB FROM DISK='C:\SQLskills\TicketSalesDB.bak'

WITH

MOVE 'TicketSalesDBData' TO 'C:\SQLDEV01SAN\TicketSalesDBData.MDF',

MOVE 'TicketSalesFG2005Q1' TO 'C:\SQLDEV01SAN\TicketSalesFG2005Q1.NDF',

MOVE 'TicketSalesFG2005Q2' TO 'C:\SQLDEV01SAN\TicketSalesFG2005Q2.NDF',

MOVE 'TicketSalesFG2005Q3' TO 'C:\SQLDEV01SAN\TicketSalesFG2005Q3.NDF',

MOVE 'TicketSalesFG2005Q4' TO 'C:\SQLDEV01SAN\TicketSalesFG2005Q4.NDF',

MOVE 'TicketSalesDBLog' TO 'C:\SQLDEV01SAN\TicketSalesDBLog.LDF',

NORECOVERY;

GO

RESTORE LOG TicketSalesDB FROM DISK = 'C:\SQLskills\TicketSalesDB_Log.bak' WITH NORECOVERY;

GO

And setup mirroring again. On SQLDEV01:

ALTER DATABASE TicketSalesDB SET PARTNER = 'TCP://SQLDEV02:5092';

GO

And on SQLDEV02:

ALTER DATABASE TicketSalesDB SET PARTNER = 'TCP://SQLDEV01:5091';

GO

And we're running again.

Step 9

Now all we need to do is fail the workload back to SQLDEV01 by executing this on SQLDEV02:

ALTER DATABASE TicketSalesDB SET PARTNER FAILOVER;

GO

Summary

So - the corrected sequence for moving a database while mirroring is running is the following:

  1. Take a full backup of the principal database on node A, and the required log backup
  2. Break the mirroring partnership
  3. Drop the current mirror database on node B
  4. Copy the backups to node B and restore it on the SAN on node B using WITH NORECOVERY, remembering to use WITH MOVE to place the files correctly
  5. Setup the mirroring partnership to point to the newly restored database on the SAN on node B
  6. Start mirroring and the new mirror will catch-up
  7. Failover to the mirror on node B, which becomes the new principal
  8. Follow the same procedure to move the new mirror on node A onto its SAN, but using the original backups from step 1
  9. Failback

Hope this helps.

Friday, September 28, 2007 7:27:44 AM (Pacific Standard Time, UTC-08:00)  #    Comments [1]  | 
Thursday, September 27, 2007

One of the hottest features in SQL Server 2005 is database mirroring, and it's helped many companies implement successful and relatively inexpensive high-availability strategies. In SQL Server 2008, Database Mirroring has been enhanced in several ways - one of which is the ability to automatically repair corrupt pages!

This feature is based on the fact that the principal and mirror databases are exactly the same. So, if a page becomes corrupt on the principal, SQL Server should be able to read the page from the mirror and use it to fix the principal. Similarly, if a page becomes corrupt on the mirror, the page can be read from the principal to fix up the mirror. Pretty cool, eh?

Details

  • The feature works for pages that have 824 errors, 823 errors where the OS returns a CRC error while reading the page (to prevent resource issues triggering a page repair), and pages that have 829 errors (where the page is marked as restore pending). See my previous post here for more details on page errors.
  • Pages are fixed asynchronously.
    • If the page is corrupt in the principal, the query that hit the corrupt page will fail. Once discovered, a page is marked as being 829 until its fixed. This prevents an issue where a transient disk error could allow a subsequent update to change the page after it's been queued for being repaired, and then the page is overwritten with a copy from the mirror, losing the update. Nasty.
    • If the page is corrupt in the mirror (which is discovered when the page is read as part of the continual recovery of the log), the mirroring session is suspended. The mirror keeps track of all corrupt pages that need to be repaired with copies from the principal. Once all corrupt pages have been repaired, the mirroring session will be resumed automatically. This means that if a page is corrupt in both the mirror and the principal, manual intervention will be required to resolve the issue.
  • The feature is only available in Enterprise Edition. This means that if the principal is on Enterprise Edition and the mirror is on Standard Edition, then corrupt pages on the principal can repaired from the mirror but not the other way around.
  • There is a new DMV - sys.dm_db_mirroring_auto_page_repair - that allows you to track corrupt pages in mirrored databases 
    • It covers all mirrored databases on a server.
    • It provides info on the last 100 pages that were found in any mirrored database, as well as the status of the automatic page repair operation.
    • This DMV isn't yet included in the SQL Server 2008 Books Online available on TechNet but will have the following info:
      • Database ID the page is in
      • The Page ID, split into file and page-in-file
      • The error type - distinguishing between 823 errors, torn-page errors, page checksum failures, and all-other-824 errors
      • The status of the page repair operation
      • The time that the status was last updated
  • If a page repair fails for any reason (e.g. the mirroring partner couldn't supply the page) then the repair will be marked as failed. If the page is then hit again (by a query on the principal or a recovery operation on the mirror) then it will be re-queued for repair.
  • Not all pages can be repaired - the file header page, database boot page, and allocation bitmap pages (GAM, SGAM, PFS) cannot be repaired this way.

In Action

I've got a system with a few SQL Server 2008 instances running so I decided to give it a try. Here's what I did, using the TicketSalesDB from the Always-On DVDs and labs that SQLskills produces:

  • Changed the database to use page checksums
  • Rebuilt the clustered index of one of the tables (so the pages have page checksums on)
  • Took the initial backup of the database that's needed for mirroring (so that it has no corruptions in)
  • Used DBCC IND to find a page in the leaf level of the clustered index and corrupted the page
  • Setup mirroring (but using the backup I took before introducing the corruption - so the mirror database will have a clean copy of the page I corrupted)
  • Performed a query on the table with the corrupt page to force the page checksum failure and kick-off automatic page repair

And it worked! Going back into the instance with the principal database and querying the DMV gives:

C:\>sqlcmd /E /S.\KATMAI01
1> SELECT * FROM sys.dm_db_mirroring_auto_page_repair;
2> GO
database_id file_id     page_id              error_type page_status modification_time
----------- ----------- -------------------- ---------- ----------- -----------------------
          6           4                 4256         -1           5 2007-09-27 17:23:20.067

(1 rows affected)
1>

The page I corrupted was (4:4256) and page_status of 5 means the repair succeeded. Running the query again confirms that the corruption has been fixed. The page was also logged in the suspect_pages table in msdb:

1> SELECT * FROM msdb..suspect_pages;
2> GO
database_id file_id     page_id              event_type  error_count last_update_date
----------- ----------- -------------------- ----------- ----------- -----------------------
          6           4                 4256           5           1 2007-09-27 17:23:20.407

(1 rows affected)
1>

In Books Online (both 2005 and 2008), event_type of 5 means that the page was repaired.

I checked the SQL Server error log for the principal and this is what I found (the bolding is mine):

2007-09-27 17:17:10.41 spid25s     Database mirroring is active with database 'TicketSalesDB' as the principal copy. This is an informational message only. No user action is required.
2007-09-27 17:23:19.92 spid51      Error: 824, Severity: 24, State: 2.
2007-09-27 17:23:19.92 spid51      SQL Server detected a logical consistency-based I/O error: incorrect checksum (expected: 0x55684fbe; actual: 0x16e84fbe). It occurred during a read of page (4:4256) in database ID 6 at offset 0x00000002140000 in file 'C:\Program Files\Microsoft SQL Server\MSSQL.6\MSSQL\Data\TicketSalesFG2005Q1.NDF'.  Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.
2007-09-27 17:23:19.96 spid26s     Database mirroring is attempting to repair physical page (4:4256) in database "TicketSalesDB" by requesting a copy from the partner.
2007-09-27 17:23:20.42 spid26s     Database mirroring successfully repaired physical page (4:4256) in database "TicketSalesDB" by obtaining a copy from the partner.

I also checked the error log for the mirror and there's nothing relevant in there. I guess the same info would be output to the mirror database's error log if the mirror became corrupted. It's reasonably simple to check this using similar steps as above, but corrupting the database before the initial backup is taken, fixing the corruption again after the backup and before mirroring starts (so the corruption is on the mirror but not the principal), and then triggering an update on the corrupt page. When the update is replayed on the mirror, the corruption will be hit and the page repaired in the same way. I'll check later and blog if there's any difference.

Summary

SQL Server 2008 introduces an enhancement to database mirroring that can bi-directionally pull pages between the principal and mirror databases to fix page corruptions. One word of caution I'd give is that this feature doesn't mean you can ignore these errors when they occur - you still need to do root-cause analysis on the corruption and take steps to prevent them happening again before a corruption occurs that automatic page repair cannot fix (as I mentioned above).

Nevertheless, this is a tremendously useful feature that's going to save a lot of downtime. Cool!

Thursday, September 27, 2007 2:53:30 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 
Sunday, September 23, 2007

One of the great things about the blog engine we use is that it shows all the search engine queries that led to someone clicking through to the site. I've been looking through the logs to see what kind of problems people are having that end up here. In this occasional series, I'm going to pick a search engine query and blog about it - hopefully helping out people who have the problem in future.

First up is running out of transaction log space. This happens when the transaction log fills up to the point where it has to grow but either autogrow is not turned on or the volume on which the transaction log is placed has no more space for the file to grow. So what causes the transaction log to fill up in the first place? It could be a number of different things. The two most common ones I've seen are:

  • The database is in full recovery mode with normal processing, a full database backup has been taken but no log backups have been taken. This commonly happens when an application goes into production and someone decides to take a backup, without realizing the consequences. Log truncation is not automatic in such a case - a log backup is required to allow the log to truncate.
  • The database is in any recovery mode but there's a very long-running, uncommitted transaction that prevents log truncation. (Even in full recovery mode with regular log backups, all the log records for the long-running transaction are required in case it rolls-back - and SQL Server can't selectively truncate log records from the log for some transactions but not others.)

Let's have a look at both of these in SQL Server 2005 and example solutions.

The log backup case...

First thing to do is create a database with a very small transaction log that we can easily fill up.

USE master;
GO
IF DATABASEPROPERTY (N'tinylogtest', 'Version') > 0 DROP DATABASE tinylogtest;
GO

CREATE DATABASE tinylogtest ON (
   
NAME = fgt_mdf,
   
FILENAME = 'c:\tinylogtest\tinylogtest.mdf',
   
SIZE = 2MB)
LOG ON (
   
NAME = fgt_log,
   
FILENAME = 'c:\tinylogtest\tinylogtest.ldf',
   
SIZE