Monday, November 19, 2007

With the class we taught yesterday in Zurich for Microsoft, Kimberly and I have just finished almost three weeks of continuous travelling and presenting at conferences. Now we're taking some time off to relax and recharge. My blog will be silent until Monday 26th November, when I'll start posting some cool articles with example scripts showing how to recover from various disasters. Thanks to everyone who's responded to the last few weeks worth of posts, and to those who've sent in questions regarding discussions we had at the various conferences. I will definitely reply to each of them, but it may not be until next week.

If you're in the US, have a great Thanksgiving! See you soon...

Monday, November 19, 2007 10:36:15 PM (Pacific Standard Time, UTC-08:00)  #    Comments [2]  | 
Sunday, November 18, 2007

This is a question that comes up a lot - how to run consistency checks on a VLDB?

We're talking hundreds of GBs or 1 TB or more. These databases are now common on SQL Server 2000 and 2005 - at TechEd IT Forum this week (and at SQL Connections the week before) there are many customers with multi-TB databases. Any experienced DBA knows the value of running consistency checks, even when the system is behaving perfectly and the hardware is rock-solid. The two problems that people have with running a full CHECKDB on their VLDB are:

  • It takes a long time to run (based on many factors – see my previous post here for details).
  • It uses lots of resources – memory, CPU, IO bandwidth, tempdb space.

So it uses lots of resources for a long time. Even with a decent sized maintenance window, the CHECKDB may run over into normal operations. There's also the case of a system that's already pegged in more or more resource dimensions. Whatever the case, there are a number of options:

  • Don't run consistency checks
  • Run CHECKDB using the WITH PHYSICAL_ONLY option
  • Use SQL Server 2005's partitioning feature and devise a consistency checking plan around that
  • Figure out your own scheme to divide up the consistency checking work over several days
  • Offload the consistency checks to a separate system

Let's look at each in turn.

Don't run consistency checks

A lot of people end up doing this as they can’t figure out how to run consistency checks on their DB. Don't even think about using this option – there’s always a way to do it. If you absolutely cannot figure out a way to get consistency checks on your system, send me email and I'll help you – I’ve helped many customers with all the schemes below, both in and out of Microsoft.

Use WITH PHYSICAL_ONLY

A full CHECKDB does a lot of stuff - see previous posts in this series for more details. You can vastly reduce the run-time and resource usage of CHECKDB by using the WITH PHYSICAL_ONLY option. With this option, CHECKDB will:

  • Run the equivalent of DBCC CHECKALLOC (i.e. check all the allocation structures)
  • Read and audit every allocated page in the database

So it skips all the logical checks, inter-page checks, and things like DBCC CHECKCATALOG. The fact that all allocated pages are read means that:

  • Any pages that cannot be read at all (i.e. 823 errors) will be discovered
  • If page checksums are enabled in SQL Server 2005, any corruptions caused by the IO subsystem will be discovered as the page checksum will be checked as part of reading the page into the buffer pool

So there's a trade-off of consistency checking depth against run-time and resource usage - but this option will pick up problems caused by the IO subsystem as long as page checksums are enabled and present.

Use the SQL Server 2005 partitioning feature

One of the obvious ways to reduce the time/resources issue is to partition the load. If you're using the partitioning feature in SQL Server 2005 then you're already setup for this. Given that you've hopefully got your partitions stored on separate filegroups, you can use the DBCC CHECKFILEGROUP command.

Consider this example - you have the database partitioned by date such that the current month is on a read-write filegroup and the past 11 months are on 11 read-only filegroups (data from more than a year ago is on some offline storage medium). The prior months also have multiple backups on various media so are considered much 'safer' than the current month. It makes sense that you don't need to check the read-only filegroups as often as the current month's filegroup so an example consistency checking scheme could be:

  • Run a DBCC CHECKFILEGROUP on each read-only filegroup every week or two
  • Run a DBCC CHECKFILEGROUP on the read-write filegroup every day or two (depending on the stability of the hardware, the criticality of the data, and the frequency and comprehensiveness of your backup strategy).

I know of several companies who've made the decision to move to SQL Server 2005 in part because of this capability to easily divide up the consistency checking.

Beware that until SP2 of SQL Server 2005, DBCC CHECKFILEGROUP would not check a table at all if it was split over multiple filegroups. This is now fixed and DBCC CHECKFILEGROUP will check partitions on the specified filegroup even if the table is now completely contained on the filegroup.

Figure out your own way to partition the checks

If you're on SQL Server 2000, or you just haven't partitioned your database on SQL Server 2005, then there are ways you can split up the consistency checking workload so that it fits within a maintenance window. You basically need to simulate what CHECKDB does in a staggered manner. Here's one scheme that I've recommended to several customers:

  • Figure out your largest tables (by number of pages) and split the total number into 7 buckets, such that there are a roughly equal number of database pages in each bucket.
  • Take all the remaining tables in the database and divide them equally between the 7 buckets (using number of pages again)
  • On Sunday:
    • Run a DBCC CHECKALLOC
    • Run a DBCC CHECKCATALOG
    • Run a DBCC CHECKTABLE on each table in the first bucket
  • On Monday, Tuesday, Wednesday:
    • Run a DBCC CHECKTABLE on each table in the 2nd, 3rd, 4th buckets, respectively
  • On Thursday:
    • Run a DBCC CHECKALLOC
    • Run a DBCC CHECKTABLE on each table in the 5th bucket
  • On Friday and Saturday:
    • Run a DBCC CHECKTABLE on each table in the 6th and 7th buckets, respectively

In pre-RTM builds of SQL Server 2005, DBCC CHECKTABLE could not bind to the critical system tables, just like with T-SQL - but that's fixed so you can cover all system tables in SQL Server 2000 and 2005 using the method above.

There's one drawback to this method - a new internal database snapshot is created each time you start a new DBCC command, even for a DBCC CHECKTABLE. If the update workload on the database is significant, then there could be a lot of transaction log to recover each time the database snapshot is created - leading to a long total run-time. In this case, you may need to alter the number of buckets you use to make the total operation fit within your available window.

Use a separate system

This alternative is relatively simple - restore your backup (you are taking regular backups, right?) on another system and run a full CHECKDB on the restored database. This offloads the consistency checking burden from the production system and also allows you to check that your backups are valid. There are some drawbacks to this however:

  • You need to have sufficient disk space on the spare system to be able to restore the backup onto. If the production database is several TB, you need the same several TB on the spare box. This equates to a non-trivial amount of money - initial capital investment plus ongoing storage management costs. (Hopefully a future release will alleviate this – while at Microsoft I invented and patented a mechanism for consistency checking a database in a backup without restoring it.)
  • If the consistency checks find an error, you don't know for sure that the database is corrupt on the production system. It could be a problem with the spare box that's caused the corruption. The only way to know for sure is to run a consistency check on the production system. This is a small price to pay though, because most of the time the consistency checks on the spare system will be ok, so you know the production database was clean at the time the backup was taken.

Summary

You've got a bunch of choices to allow you to run consistency checks, so there's really no excuse for not knowing (within a reasonable timeframe) that something's gone wrong with your database.

Sunday, November 18, 2007 8:11:20 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 

It's been a very tough couple of weeks for us with back-to-back conferences in Las Vegas and Barcelona. Now we're in Zurich for a few days before heading back to Redmond for the rest of the year. TechEd IT Forum was probably the most tiring conference I've ever done - we did 12 sessions between us in four days, co-presenting 10 of them, and with 5 back-to-back on Thursday. So Friday night was party-time, first with the traditional Speaker Dinner and then a bunch of us went out dancing until 3am. Sore heads abounded the next morning as we all headed to the airport for flights - I'm glad Kimberly and I only had a short hop to Zurich rather than our long flights back to the West Coast.

Here's a shot Kimberly took of me, Mark Russinovich, and John Craddock headed towards the nightclub. Look forward to seeing you in Barcelona again next year!

Sunday, November 18, 2007 7:05:50 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 

On the last day of SQL Connections a couple of weeks back we did a 20 minute TV interview with Steve Wynkoop of SSWUG. Apart from the mandatory mention of our favorite game Blokus, we discussed a bunch of new features coming in SQL Server 2008 while I struggled not to say 'we' instead of 'they' to describe the SQL team. I'm still in recovery I guess...

Check it out at http://www.sswug.org/sswugtv/seeshow.asp?sid=P227.

Sunday, November 18, 2007 6:49:24 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 
Thursday, November 15, 2007

Here's a question that came up yesterday in our chalk-talk on database mirroring at TechEd IT Forum that Kimberly and I talked about this morning (here in Barcelona).

Q) I have a database mirroring session where the witness and mirror servers are in one physical location, and the principal server is in another. The mirroring session is running synchronously with the witness to allow automatic failover. A disaster happens to the site where the mirror and witness are, so the principal database is unavailable. I can't seem to access the principal at all to bring it back online by removing the witness and the mirror and witness won't be available for hours. What can I do?

A) The behavior you're seeing (the principal database becoming unavailable) is expected. In a mirroring configuration with a witness, the principal needs to have quorum with (i.e. be able to see) at least one of the other partners, either the mirror, the witness, or both. If it can't see either, it doesn't know whether the witness and mirror can still see each other and the mirror may have brought itself online as the new principal. (Kimberly likes to say that the principal thinks the witness and mirror are conspiring against it :-)) In this case though, the customer knows that the mirror and witness are actually down and so he wants to bring the principal database back online.

I repro'd this situation in a VPC with three SQL Server 2008 instances running mirroring between them (the behavior is exactly the same in 2008 and 2005). I did a net stop on the mirror and witness servers and the principal database went offline. Trying to get into the principal database results in the following error:

USE TicketSalesDB;

GO

 

Msg 955, Level 14, State 1, Line 1

Database TicketSalesDB is enabled for Database Mirroring, but the database lacks quorum: the database cannot be opened.  Check the partner and witness connections if configured.

This is what I'd expect. The customer tried to remove the witness so let's try that:

ALTER DATABASE TicketSalesDB SET WITNESS OFF;
GO

Msg 1431, Level 16, State 4, Line 1
Neither the partner nor the witness server instance for database "TicketSalesDB" is available. Reissue the command when at least one of the instances becomes available.

That doesn't work either because removing the witness needs to happen on one of the partners as well as the principal. The only way to get out of this situation is to break the mirroring partnership completely.

ALTER DATABASE TicketSalesDB SET PARTNER OFF;
GO
USE TicketSalesDB;
GO

Command(s) completed successfully.

Thursday, November 15, 2007 7:31:48 PM (Pacific Standard Time, UTC-08:00)  #    Comments [2]  | 

In my previous post on interpreting CHECKDB output, plus in my DBCC Internals session at TechEd IT Forum yesterday, I mentioned there are some things that CHECKDB can’t repair. In this post I want to go into a bit more detail – based on a post from my old Storage Engine blog.

Before anyone takes this the wrong way - what do I mean by "can't be repaired"? Remember that that purpose of repair is to make the database structurally consistent, and that to do this usually means deleting the corrupt data/structure (that's why the option to do this was aptly named REPAIR_ALLOW_DATA_LOSS – see this post for more explanation on why repair can be bad). A corruption is deemed unrepairable when it doesn't make sense to repair it given the damage the repair would cause, or the corruption is so rare and so complicated to repair correctly that it's not worth the engineering effort to provide a repair. Remember also that recovery from corruptions should be based on a sound backup strategy, not on running repair, so making this trade-off in functionality makes sense.

Here's a few of the more common unrepairable corruptions that people run into along with the reasons they can't be repaired by DBCC.

PFS page header corruption

An example of this is on SQL Server 2005:

Msg 8946, Level 16, State 12, Line 1
Table error: Allocation page (1:13280496) has invalid PFS_PAGE page header values.
Type is 0. Check type, alloc unit ID and page ID on the page.

CHECKDB uses the PFS pages to determine which pages are allocated - and so which pages to read to drive the various consistency checks. The only repair for a PFS page is to reconstruct it - they can't simply be deleted as they're a fixed part of the fabric of the database. PFS pages cannot be rebuilt because there is no infallible way to determine which pages are allocated or not. There are various algorithms I've experimented with to rebuild them, with optimistic or pessimistic setting of page allocation statuses and then re-running the various consistency checks to try to sort out the incorrect choices, but they all require very long run-times. Given the frequency with which these corruptions are seen, and the engineering effort required to come up with an (imperfect) solution, I made the choice to leave this as unrepairable, and I don’t think that will change in future.

Critical system table clustered-index leaf-page corruption

An example of this is on SQL Server 2000:

Server: Msg 8966, Level 16, State 1, Line 1
Could not read and latch page (1:18645) with latch type SH. sysindexes failed.

And on SQL Server 2005:

Msg 7985, Level 16, State 2, Server SUNART, Line 1
System table pre-checks: Object ID 4. Could not read and latch page (1:51) with
latch type SH. Check statement terminated due to unrepairable error.

In a previous post in the series I described why how and why we do special checks of the clustered indexes of the critical system tables. If any of the pages at the leaf-level of these indexes are corrupt, we cannot repair them. Repairing would mean deallocating the page, wiping out the most important metadata for potentially hundreds of user tables and so effectively deleting all of these tables. That's obviously an unpalatable repair for anyone to allow and so CHECKDB doesn't do it.

Column value corruption

Here's an example of this on SQL Server 2005:

Msg 2570, Level 16, State 3, Line 1
Page (1:152), slot 0 in object ID 2073058421, index ID 0, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type "In-row data"). Column "c1" value is out of range for data type "datetime".  Update column to a legal value.

This is where a column has a stored value that is outside the valid range for the column type. There are a couple of repairs we could do for this:

  1. delete the entire record
  2. insert a dummy value

#1 isn't very palatable because then data is lost and it’s not a structural problem in the database so doesn't have to be repaired. #2 is dangerous - what value should be chosen as the dummy value? Any value put in may adversely affect business logic, or fire a trigger, or have some unwelcome meaning in the context of the table - even a NULL. Given these problems, I chose to allow people to fix the corrupt values themselves.

Metadata corruption

Here's an example of this on SQL Server 2005:

Msg 3854, Level 16, State 1, Line 2
Attribute (referenced_major_id=2089058478) of row (class=0,object_id=2105058535,column_id=0,referenced_major_id=2089058478,referenced_minor_id=0) in sys.sql_dependencies has a matching row (object_id=2089058478) in sys.objects (type=SN) that is invalid.

This example is relatively benign. There are other examples that will cause CHECKDB to terminate - not as bad as the critical system table corruption example above, but enough that CHECKDB doesn't trust the metadata enough to use it to drive consistency checks. Repairing metadata corruption has the same problems as repairing critical system table corruption - any repair means deleting metadata about one or more tables, and hence deleting the tables themselves. It's far better to leave the corruption unrepaired so that as much data as possible can be extracted from the remaining tables.

Summary

Repair can't fix everything. You may end up having to perform manual and time-consuming data extraction from the corrupt database and losing lots of data because of, say, a critical system table corruption. Bottom line (as usual) - make sure you have valid backups so you don't get into this state!

Thursday, November 15, 2007 6:28:50 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 
Wednesday, November 14, 2007

This is a subject I posted about last year on my old blog but it came up at SQL Connections last week several times so I want to repost it for those who’ve just started following my blog.

There's only one time when you should be trying to work out how long a CHECKDB is going to take - when you're planning your regular database maintenance. If you're faced with a corrupt (or suspected corrupt) database and you're only just starting to think about how long a CHECKDB is going to take - you've made a mistake while planning your disaster recovery strategy. You always need to know how long CHECKDB takes (on average) to run for your database so:

  • You can tell whether a particular run of CHECKDB is taking longer than usual - a sign that it's found some corruption
  • You know how long it will take to get results in a disaster recovery situation

At every conference I go to, someone asks me how long CHECKDB will take to run on their database. There are several ways I could answer this:

  • The unhelpful answer - I've got no idea.
  • The almost-helpful answer - how long did it take to run last time and are the conditions exactly the same?
  • The answer I usually give - it depends.

Now, many people would see the third answer as being somewhat equivalent to the first answer - unhelpful. The problem is that there are many factors which influence how long CHECKDB will take to run. Let me explain the ten most important factors so you get an idea why this is actually a helpful answer. These aren't in any particular order of importance.

1) The size of the database

Pretty obvious... CHECKDB has to read every allocated page in the database so the bigger it is, the longer it will take to read all the pages.

2) Concurrent IO load on the server

At the simplest level, what is CHECKDB going to do? It reads every allocated page in the database. That's a lot of IO. CHECKDB takes great pains to do the most efficient IO it can and read the database pages in their physical order with plenty of readahead so that the disk heads move smoothly across the disks (rather than jumping around randomly and incurring disk head seek delays). If there's no concurrent IO load on the server, then the IOs will be as efficient as CHECKDB can make them. However, introducing any additional IO from SQL Server means that the disk heads will be jumping around - slowing down the CHECKDB IOs. If the IO subsystem is at capacity already from CHECKDB's IO demands, any additional IO is going to reduce the IO bandwidth available to CHECKDB - slowing it down.

3) Concurrent CPU activity on the server

At the next level of simplicity, CHECKDB is going to process every page it reads in some way. Depending on the various options you've specified and the database schema (details below), that's going to use a lot of CPU - it's possible that the server may be pegged at 100% CPU when CHECKDB is running. If there's any additional workload on the server, that's going to take CPU cycles away from CHECKDB and it going to slow it down.

Basically what points #2 and #3 are saying is that CHECKDB is very resource intensive! It’s probably one of the most resource intensive things you can ask SQL Server to do and so it's usually a good idea to not run it during peak workload times, as you'll not only cause CHECKDB to take longer to run, you will slowdown the concurrent workload, possibly unacceptably.

4) Concurrent update activity on the database

This is relevant for both SQL 2000 and SQL 2005, but for different reasons.

In SQL 2000, CHECKDB gets its consistent view of the database from transaction log analysis of concurrent DML transactions (see here for details). The more concurrent DML there is while CHECKDB is running, the more transaction log will be generated - and so the longer it will take for CHECKDB to analyze that transaction log. It's possible that on a large multi-CPU box with a ton of concurrent DML and CHECKDB limited to a single CPU that this phase of CHECKDB could take several times longer than the reading and processing of the database pages! (I've seen this in real-life several times.)

In SQL 2005, CHECKDB gets its consistent view of the database from a database snapshot, which is stored on the same disk volumes as the database itself. If there are a lot of changes in the database while CHECKDB is running, the changed pages are pushed to the snapshot so that it remains consistent. As the snapshot files are stored in the same location as the database files, every time a page is pushed to the snapshot, the disk head has to move, which interrupts the efficient IO described in #2. Also, whenever CHECKDB goes to read a page and it needs to read the page from the snapshot files instead of the database files, that's another disk head move, and another efficient IO interruption. The more concurrent changes to the database, the more interruptions to efficient IO and the slower that CHECKDB runs.

5) Throughput capabilities of the IO subsystem

This one's simple. CHECKDB is going to do a boat-load of IOs and it could even end up being IO-bound (meaning that the CPUs are idle periodically waiting for IOs to complete) depending on the options specified and the database schema. This means that the throughput of the IO subsystem is going to have a direct effect on the run-time of CHECKDB. so, if you have a 1TB database and the IO subsystem can only manage 100MB/sec, it's going to take almost 3 hours just to read the database (1TB / 100MB / 3600 secs) and there's nothing you can do to speed that up except upgrade the IO subsystem.

I've lost count of the number of times I've heard customers complain that CHECKDB (or index rebuilds or other IO-heavy operations) are running sloooowly only to find that the disk queue lengths are enormous and the IO subsystem it entirely unmatched to the server and workload.

6) The number of CPUs (processing cores) on the box

This also really encompasses the Edition of SQL Server that's being run. In Enterprise Edition, CHECKDB can run in parallel across all the CPUs in the box (or as many as the query processor decides to parallelize over when the CHECKDB internal queries are compiled). Running in parallel can give a significant performance boost to CHECKDB and lower run times, as long as the database is also spread over multiple files too (so the IOs can be parallelized). There's a nifty algorithm that’s used that allows CHECKDB to run in parallel which I'll explain in detail in a future post.

On the other hand, the fact that CHECKDB can run in parallel in Enterprise Edition can be bad for some scenarios, and so some DBAs chose to force CHECKDB to be single-threaded. SAP typically recommends this to help with user query predictability. The way to do this is to turn on the documented trace flag 2528.

7) The speed of the disks where tempdb is placed

Running CHECKDB against a VLDB uses lots of memory for internal state and for VLDBs the memory requirement usually exceeds the amount of memory available to SQL Server. In this case, the state is spooled out to tempdb and so the performance of tempdb can be a critical factor in CHECKDB performance. See this post for more details of this and how CHECKDB can run out of disk space if tempdb is too small.

8) The complexity of the database schema

This can have a really high impact on the run-time of CHECKDB because it impacts the amount of CPU that CHECKDB requires. For example, the most expensive checks that CHECKDB does are for non-clustered indexes. It needs to check that each row in a non-clustered index maps to exactly one row in the heap or clustered index for the table, and that every heap/clustered index row has exactly one matching row in each non-clustered index. Although there's a highly efficient algorithm for doing this, it still takes around 30% of the total CPU that CHECKDB uses!

There are a bunch of other checks that are only done if the features have been used in the database - e.g. computed column evaluation, links between off-row LOB values, Service Broker, XML indexes, indexed views - so you can see that empirical factors along aren't enough to determine the run-time.

9) Which options are specified

This is almost the same as #7 in that by specifying various options you're limiting what checks CHECKDB actually performs. For instance, using the WITH NOINDEX option will turn off the non-clustered index checks that I described in #7 and using the WITH PHYSICAL_ONLY option will turn off all logical checks, vastly decreasing the run-time of CHECKDB and making it nearly always IO-bound rather than CPU-bound (in fact this is the most common option that DBAs of VLDBs use to make the run-time of CHECKDB manageable).

One thing to be aware of - if you specify any repair options, CHECKDB always runs single-threaded, even on a multi-proc box on Enterprise Edition.

10) The number and type of corruptions that exist in the database

Again, this is similar to #7 and #8. If there are any corruptions present, there may be extra checks triggered to try to figure out more details of the corruptions. For instance, for the non-clustered index checks, the algorithm is tuned very heavily for the case when there are no corruptions present (the overwhelming majority of cases considering the millions of times CHECKDB is run every day around the world). When a non-clustered index corruption is detected, a more in-depth algorithm has to be used to figure out exactly where the corruption is, which involves re-scanning a bunch of data and so taking a bunch more time. There are a few other algorithms like this too.

Summary

So you can see that there's no simple answer.

Wednesday, November 14, 2007 8:22:36 PM (Pacific Standard Time, UTC-08:00)  #    Comments [3]  | 
Tuesday, November 13, 2007

One of the sessions we did yesterday at IT Forum was an Instructor-Led Lab on database snapshots, an Enterprise Edition feature of SQL Server 2005. Database snapshots are not particularly well known in the DBA community and there are many misconceptions about them. The session generated a bunch of questions, some of which I'll answer here.

Q1) Can you change the data in table in a database snapshot?

A1) No.

Q2) Can you change the permissions in a database snapshot?

A2) Nope, sorry.

Q3) Can you backup a database snapshot?

A3) Afraid not.

A4) Can you detach a database snapshot?

A4) Err, nope.

You may have noticed a pattern here :-) Basically, the only things you can do with a database snapshot are select from it, and revert to it - i.e. rollback the entire database on which it is based to the point-in-time at which the database snapshot was created. Database snapshots are not updateable in any way, and at the time I left MS, there were no plans to change that in the future. Regardless of these limitations, database snapshots are very useful. I searched both our blogs for some examples to link to and couldn't find any so I'll put together a post on using them sometime over the next week or so.

The final question I *was* able to answer successfully for a very happy conference attendee:

Q5) Can you create a partial snapshot on a database, say a single filegroup, for reporting?

A5) No - HOWEVER, if you're on Enterprise Edition, you can make use of partial database availability. Hopefully you have your database structured so the primary filegroup has nothing but system tables in, and you have the table you're interested in isolated in a seperate filegroup - let's call it filegroup X. (He answered 'yes' to all three!). If that is that case, backup the primary filegroup and filegroup X. Then restore the primary filegroup followed by filegroup X in a separate location. As long as the primary filegroup of a database is online in SQL Server 2005 Enterprise Edition, then the database is online and any other online filegroups are available - partial database availability. This means you've effectively created a point-in-time snapshot of a single filegroup, albeit at the expense of having to have a full copy of the filegroup.

Tuesday, November 13, 2007 11:32:08 PM (Pacific Standard Time, UTC-08:00)  #    Comments [2]  | 

Last week at SQL Connections someone said that CHECKDB’s output is ‘useless’.  Given that I wrote CHECKDB for SQL Server 2005 I was only mildly offended :-) But there’s a lot of truth in that statement – the error messages from CHECKDB do tell you exactly what’s corrupt in the database but they’re not exactly easy to read and understand unless you’re intimately familiar with the on-disk structures used to store a SQL Server database. It’s pretty easy to understand the output if there’s only one or two error messages, but once you get past 10 or so it becomes extremely difficult to figure out what the actual problem is for normal people (i.e. not CHECKDB geeks like me, Ryan or a handful of senior people in Product Support)

So, how are you supposed to figure out what’s wrong? In this post I want to run through a few tips and tricks you can use.

  • If the recommended repair level (at the very bottom of the output) is REPAIR_REBUILD then only non-clustered indexes are damaged. This meanscyou don’t need to run repair (which needs the database in single-user mode) or restore from backups. You can just manually rebuild the damaged indexes to fix the problem, possibly even online if you're running Enterprise Edition and the index doesn't have any LOB columns.
  • There are a couple of cases where even if it’s only non-clustered indexes that are damaged, REPAIR_ALLOW_DATA_LOSS may still be the recommended repair option. In this case, check the index ID in all the error messages – if all index IDs are 2 or higher, then it’s only non-clustered indexes that are damaged and you can rebuild them yourself instead of running repair/restore.
  • Figure out a list of tables that are affected by looking at the summary messages about how many corruptions were found per table. If each of the tables is isolated on a separate filegroup then these filegroups can be taken offline and restored individually, possibly without interrupting the application workload. If there's no backup, then you can run DBCC CHECKTABLE with repair, for a faster repair operation than running a full CHECKDB.
  • Figure out the list of damaged pages. There are a number of error messages that could point to a damaged page – usually 8928 or 2537 – and then you can choose to do single-page restores from your backups to minimize downtime, possibly even online in Enterprise Edition.
  • Are there any errors that CHECKDB can’t repair?  If so, you have no choice but to restore from backups or to extract as much info as possible into a new database. Things to look for that mean repair won’t be able to fix everything are:
    • CHECKDB stops early and complains about system table pre-checks failing (errors 7984 – 7988 inclusive)
    • CHECKDB reports any metadata corruption (8992, 8995 errors)
    • CHECKDB reports any errors on PFS page headers (8939, 8946 errors with a possible 8998 error as well)

While I was at Microsoft I wrote two 200-page+ documents detailing all the CHECKDB error messages for SQL Server 2000 and 2005 – the team says they should all be translated into Books Online entries by next summer – a bunch of them are there already. In the meantime, I hope the hints above will take a little of the mystery out of CHECKDB’s output.

Tuesday, November 13, 2007 11:07:50 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 

(Oops - deleted this by accident - re-entering it.)

Day 1 for us here in Barcelona was pretty tiring. After flying in from Seattle on Monday, and waking up with jet-lag on Tuesday at 2am, we did 4 sessions during the day, with 8 more to go by Friday. By the time we got back to the speaker hotel we were both totally beat but we rallied for a nice dinner with our good friend (and last week's top-scoring TechEd speaker - congratulations!) Rafal Lukawiecki. All our sessions were well attended with lots of excellent questions.

In the afternoon gap between sessions 3 and 4 we sat down with the TechEd TV crew for a short interview on a very comfy couch - check it out (and others from the conference) here.

Tuesday, November 13, 2007 6:42:36 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 
Monday, November 12, 2007

I’m in the middle of a flight from Washington D.C. to Zurich on the way to Barcelona for TechEd IT Forum and I can’t sleep – Kimberly’s out like a light so what else is there to do except write another blog post? :-) OK - actually posting this from Barcelona on Tuesday before our first of 12 sessions here!

In the Database Maintenance workshop we did at SQL Connections last week I promised to blog a script I used to show how data file shrink operations cause massive fragmentation of indexes. The reason is that data file shrink starts at the end of the data file and moves a single page at a time to a free space below the shrink threshold. In the process of doing so, it perfectly reverses the physical order of the pages comprising the leaf level of an index – thus perfectly fragmenting it!!

Let’s try out my simple script that demonstrates this. First thing I’m going to do is create a new database and create a 10MB ‘filler’ table, which I’m going to then drop later to create a space that shrink can use.

USE MASTER;

GO

 

IF DATABASEPROPERTYEX ('shrinktest', 'Version') > 0

      DROP DATABASE shrinktest;

 

CREATE DATABASE shrinktest;

GO

USE shrinktest;

GO

 

SET NOCOUNT ON;

GO

 

-- Create and fill the filler table

CREATE TABLE filler (c1 INT IDENTITY, c2 VARCHAR(8000))

GO

DECLARE @a INT;

SELECT @a = 1;

WHILE (@a < 1280) -- insert 10MB

BEGIN

      INSERT INTO filler VALUES (REPLICATE ('a', 5000));

      SELECT @a = @a + 1;

END;

GO

Next I’ll create the ‘production’ table that we’d really like to keep in optimal shape for performance.

-- Create and fill the production table

CREATE TABLE production (c1 INT IDENTITY, c2 VARCHAR (8000));

CREATE CLUSTERED INDEX prod_cl ON production (c1);

GO

DECLARE @a INT;

SELECT @a = 1;

WHILE (@a < 1280) -- insert 10MB

BEGIN

      INSERT INTO production VALUES (REPLICATE ('a', 5000));

      SELECT @a = @a + 1;

END;

GO 

Now I’ll use the sys.dm_db_index_physical_stats DMV to check the fragmentation of the production table’s clustered index – it should be almost zero:

-- check the fragmentation of the production table

SELECT avg_fragmentation_in_percent, fragment_count FROM sys.dm_db_index_physical_stats (

      DB_ID ('shrinktest'), OBJECT_ID ('production'), 1, NULL, 'LIMITED');

GO

avg_fragmentation_in_percent fragment_count

---------------------------- --------------------

0.390930414386239            6

This is what I expected. Now I’m going to drop the filler table, run a shrink operation and then check the fragmentation again:

-- drop the filler table and shrink the database

DROP TABLE filler;

GO

 

-- shrink the database

DBCC SHRINKDATABASE (shrinktest);

GO

 

-- check the index fragmentation again

SELECT avg_fragmentation_in_percent, fragment_count FROM sys.dm_db_index_physical_stats (

      DB_ID ('shrinktest'), OBJECT_ID ('production'), 1, NULL, 'LIMITED');

GO

avg_fragmentation_in_percent fragment_count

---------------------------- --------------------

99.7654417513683             1277

Wow! The index went from almost 0% fragmented to almost 100% fragmented – the shrink operation totally reversed the physical ordering of the leaf level of the clustered index – nasty.

One of the common maintenance operations I see at customer sites is to run a database shrink at some interval, and I always advise against it – now you can see why. Running a regular shrink operation can cause horrible fragmentation problems. The worst problems I see are those customers with maintenance plans that rebuild all indexes and then run a shrink to remove the extra space necessary for the index rebuilds – completely undoing the effects of the index rebuild!

One other common thing I see is to have auto-shrink set on for one or databases. This is bad for several reasons:

  • Shrink causes index fragmentation, as I’ve just demonstrated above.
  • You can't control when it kicks in. Although it doesn't have any effect like long-term blocking, it does take up a lot of resources, both IO and CPU. It also moves a lot of data through the buffer pool and so can cause hot pages to be pushed out to disk, slowing things down further. If the server is already pushing the limits of the IO subsystem, running shrink may push it over, causing long disk queue lengths and possibly IO timeouts.
  • You're likely to get into a death-spiral of auto-grow then auto-shrink then auto-grow then auto-shrink... (in my experience, if someone is using auto-shrink, they're most likely using and relying on auto-grow too). An active database usually requires free space for normal operations - so if you take that free space away then the database just has to grow again. This is bad for several reasons:
    • Repeatedly shrinking and growing the data files will cause file-system level fragmentation, which can slow down performance
    • It wastes a huge amount of resources, basically running the shrink algorithm for no reason
    • Auto-grow itself can be bad, especially if you're using SQL Server 2000 (or don't have Instant File Initialization turned on - see this post from Kimberly's blog) where all allocations to the file being grown are blocked while the new portion of the file is being zero-initialized.

Bottom-line: auto-shrink should *NEVER* be turned on…

Monday, November 12, 2007 8:28:41 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 
Friday, November 09, 2007

This is a question I was asked multiple times over the last week: the Microsoft guidelines for database mirroring say not to mirror more than 10 databases per instance - why is that and is it true?

The answer is my favorite 'it depends!'. The number 10 is a rough guess at the sweet-spot for the majority of customers. The factors that need to be considered are:

  • How much memory do the principal and mirror instances have? (hopefully the same)
  • How much processing-power do the principal and mirror instances have? (hopefully the same)
  • How much bandwidth does the IO subsystem have on the mirror instance? (hopefully the same as on the principal)
  • How much transaction log does the workload on each database generate?
  • How much network bandwidth is available between the principal and the mirror instances?

The last two factors are the most critical. If the network bandwidth available between the two instances is not enough to handle the combined transaction log generation rate per second from all databases being mirrored then performance will drop on the principal databases. SQL Server 2008 does alleviate some of this with log stream compression - see here for details. The next most critical thing to consider is the memory and thread requirements for mirroring - each mirrored database takes one thread plus some memory, so on low-powered servers, lots of mirrored databases may be too much load on the server when combined with the regular workload.

Here are some examples that I've seen:

  • A customer with 150 databases, all of which have very small amounts of activity, and not all at the same time, has them all mirrored with no problem.
  • A customer with only 3 heavily-loaded databases, but without a great network connection, that can barely mirror one of the databases without the lack of network bandwidth causing workload degradation.

The key to success here is to do the log generation calculation and then if it seems that the available network bandwidth will support the number of databases you want to mirror, test it first before relying on it in production.

I guess the bottom-line here is that any broad guidance is only that - your mileage may (and probably will) vary. Always do your own calculations and testing.

Friday, November 09, 2007 3:58:53 PM (Pacific Standard Time, UTC-08:00)  #    Comments [1]  | 

In one of the sessions Kimberly and I taught this week at SQL Connections, we discussed how to choose efficient data-types - I'd like to share the discussion here with an example.

The crux of the matter is in the amount of space required to store the data-types in the schema, and ensuring that the data-type choice matches the data being stored without incurring any unnecessary overhead. Here are a couple of examples:

  • A field that stores a person's age will usually be able to use a TINYINT type that can store a maximum value of 256 - unless its an archeological or historical database, for instance. Without putting a lot of thought into it, however, someone may choose to use an INTEGER type. What's the difference? A TINYINT take a single byte to store, whereas an INTEGER takes *4* bytes to store - wasting 3 bytes per record.
  • A field that stores a person's city of residence in the USA needs to be able to hold 24 characters (see my previous post on how long fields have to be) - so what data-type should you use? A CHAR (24) will be able to store all the possible values, but will *always* take up 24 bytes in the record as it's a fixed-length column. A VARCHAR (24), on the other hand will only store the number of bytes equal to the number of characters in the city name, so using the fixed-length type will waste a varying number of bytes per record.
  • In the USA, the names of the various States are commonly abbeviated to two characters - for instance, Washington = WA and California = CA. So is the best type to use CHAR (2) or VARCHAR (2)? Given that the abbreviations are always 2 characters, they'll both store 2 bytes all the time. BUT, a variable length column has a two-byte overhead (the two-byte pointer for the column that's stored in the variable-length column offset array in the record) - so in this case the best data-type to use is CHAR (2).

You can see how choosing data-types without considering whether it's the best type can lead to wasted space. Let's look at an example schema to support a population census of the USA.

CREATE TABLE CensusSchema1 (

SSN CHAR (256),

StateName CHAR (256),

Sex INTEGER,

Employed INTEGER,

SelfEmployed INTEGER,

EthnicOrigin INTEGER,

MaritalStatus INTEGER,

NumberOfDependents INTEGER,

Age INTEGER,

CountryOfBirth INTEGER,

HouseholdIncomeGroup INTEGER,

ZipCode5 INTEGER);

GO

At first glance this may look reasonable, but digging in you'll see that many of the columns are over-sized. Here's a cleaned-up schema to compare against, with notes on the per-column savings:

CREATE TABLE CensusSchema2 (

SSN CHAR (9), -- saving 244 bytes

StateName VARCHAR (256), -- saving at least 240 bytes (longest state name is 14 + 2-byte varchar overhead)

Sex BIT,

Employed BIT,

SelfEmployed BIT, -- saving 11 bytes altogether over these three fields

EthnicOrigin TINYINT, -- saving 3 bytes

MaritalStatus TINYINT, -- saving 3 bytes

NumberOfDependents TINYINT, -- saving 3 bytes

Age TINYINT, -- saving 3 bytes

CountryOfBirth TINYINT, -- saving 3 bytes

HouseholdIncomeGroup TINYINT, -- saving 3 bytes

ZipCode5 INTEGER); -- no saving

GO

The bad schema gives a per-row size of 574 bytes and the cleaned-up schema is no more than 48 bytes per-row. I designed these two schemas to support a US census. The population of the USA is approx. 300 million. This means the bad schema would take around 190GB to store all the info, and the cleaned-up schema only takes around 15GB - more than 12 times more efficient! Now we're starting to see how poor data-type choice can lead to poor performance.

Wider rows means:

  • Fewer rows can fit on an 8k page.
  • More CPU is necessary to crack open a record due to the number of CPU data cache line invalidations necessary to read the record into the CPU's various caches (every time a cache line is invalidated it takes a bunch of CPU cycles - see here for an explanation of CPUs and cache lines).

Less rows per page means:

  • More pages are needed to store the data
  • Indexes could have a smaller fan-out (if the index keys are wider than then need to be), leading to more levels in the index and less efficient index navigation from the index root page to the index leaf-level.

More pages means:

  • More IOs are necessary to read the same amount of actual data
  • More buffer pool memory is necessary to hold the same amount of actual data
  • More disk space is necessary to hold the same amount of actual data

Clearly there's a link between various aspects of workload performance and badly chosen data-types.

Friday, November 09, 2007 2:31:10 PM (Pacific Standard Time, UTC-08:00)  #    Comments [4]  | 
Tuesday, November 06, 2007

A couple more questions from the last couple of classes.

Q1) Why doesn't performing an index rebuild alter the fragmentation?

A1) Here are the possibilities - all of which I've seen happen:

  • There isn't an index - either DBCC DBREINDEX or ALTER INDEX ... REBUILD are being run on a table that only has a heap, and so the (extent) fragmentation of the heap isn't changing because there's no way to rebuild a heap (except by the heavily NOT recommended method of creating and then dropping a clustered index).
  • The index is too small. An index with only a handful of pages may not show any changes in fragmentation because all the pages are single, mixed pages (see my previous post on extent types for more info) and so rebuilding the index does nothing.
  • The workload and schema are such that by the time the rebuild has finished and the fragmentation calculation has been done again, the index is already getting fragmented again.
  • The Extent Scan Fragmentation result from DBCC SHOWCONTIG is being used to gauge fragmentation for an index stored in a filegroup with multiple files. The Extent Scan Fragmentation in DBCC SHOWCONTIG does not cope with multiple files (as is documented in Books Online) and so the value may even go UP in some cases!

Q2) What operations take advantage of minimal-logging when the recovery mode is BULK_LOGGED?

A2) The list is very small - 4 four classes of operations:

  • Index builds, rebuilds, or drop of a clustered index (NOT index defrags with DBCC INDEXDEFRAG or ALTER INDEX ... REORGANIZE - this is a common misconception).
  • Bulk load operations (i.e. BCP, INSERT ... SELECT * FROM OPENROWSET (BULK...), and BULK INSERT).
  • Insert or appends of LOB data (either using WRITETEXT/UPDATETEXT for TEXT/NTEXT/IMAGE data types, or UPDATE with a .WRITE clause).
  • SELECT INTO operations on permanent tables.

For these operations, only the allocations are logged in the transaction log. Any extents that are allocated and changed through a minimally-logged operation are marked in the ML bitmaps (one for every 4GB of each file) and then the next transaction log backup will also read all those extents and include them in the backup.

Tuesday, November 06, 2007 3:17:02 PM (Pacific Standard Time, UTC-08:00)  #    Comments [3]  | 

This is a question that came up yesterday in out Disaster Recovery class so I'm typing it up in between attending sessions at Microsoft Day here at SQL Connections. It's an interesting experience watching all the MS speakers walking around in the distinctive blue shirts and no longer having to wear one myself.

The question is the following - why does DBCC CHECKDB terminate with an out-of-space error in SS2005? A customer had a 500GB database spread over 17 LUNs on a SAN, with each LUN having only 5GB free. There's a heavy concurrent workload running while CHECKDB is running and very often it doesn't compelte but instead stops with an error. What's going on?

The reason for this is the way that CHECKDB gets a transactionally consistent view of the database. In a nutshell, it creates a internal database snapshot of the database and then checks the database snapshot (you can read more about this in my previous post detailing all the steps of CHECKDB). A database snapshot needs to have one snapshot file for each file in the source database. In the case of CHECKDB, the snapshot files are created as alternate-streams of the existing database files - i.e. stored on the same disk volume - and you have no control over this. This means any changes to the database while CHECKDB is running will cause these alternate streams to grow.
 
If there's a significant concurrent workload while CHECKDB is running, then these alternate streams can get very big very quickly. In the case described in the question, the workload caused one of the alternate streams to take up all available space on the LUN and then it ran out of space. When this happens the snapshot is no longer valid and so CHECKDB has to stop.
 
There are two solutions to this. The obvious first one is to run CHECKDB in a period with low concurrent workload. That's not feasible for many 24x7 shops so the better solution is to create your own database snapshot (so you can control the placement of it) and then run CHECKDB on that. This is no different than having CHECKDB create its own snapshot.
 
There's another reason that CHECKDB may stop with an out-of-space error.  CHECKDB has to validate the contents of the whole database. As such it needs to store info about things it has seen at one point during database scannig so that it can match that info against things it sees at later points during database scanning. These bits of info are stored in an in-memory worktable. In many cases though, the amount of info being stored exceeds the available memory and the worktable needs to spill over to tempdb. (For example, running CHECKDB against a 1TB database on a machine with 16GB of memory - the amount of info that CHECKDB needs to store the intermediate pieces of info will likely exceed the memory available to SQL Server). So - if tempdb isn't big enough to store the worktable, it has to grow. If it can't grow, then CHECKDB will fail because it needs the worktable to operate.
 
You can check in advance how much (estimated) space will be needed by CHECKDB by running it using the WITH ESTIMATEONLY option and then making sure that tempdb is sized accordingly to accomodate the CHECKDB requirements AND the regular tempdb space requirements. See Capacity Planning for tempdb in Books Online for more info.
Tuesday, November 06, 2007 2:23:12 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 
Monday, November 05, 2007

Sitting here in our Disaster Recovery class at SQL Connections and Kimberly's on till lunch so I'm banging out a quick blog post covering the database mirroring (DBM) specific questions.

Q1) Can I use IP addresses instead of server names when using the DBM Monitor?

A1) Unfortunately not.

Q2) Is there any in-built throttling mechanism in DBM to allow the mirror to ca