Monday, July 21, 2008

While we were at TechEd in June, Kimberly and I participated in an hour-long discussion panel (Leveraging SQL Server Technologies to Build a Solid High-Availability Strategy) that was video-taped by the TechEd Online folks. It's now been edited and is available for download/viewing. We cover everything from requirements analysis to technology details in SQL Server 2008. The other panel members were Satya Jayanty, Allan Hirt, Kevin Farlee, and Amit Bansal.

You can browse the various online videos at http://technet.microsoft.com/en-us/events/teched/cc561184.aspx or go straight to the panel discussion video here.

Enjoy!

Monday, July 21, 2008 10:00:50 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 
Thursday, July 10, 2008

Now that I've done all the business-related blog posts, back to the good stuff to stop people complaining!

Something that's cropped up a few times over the summer so far is people trying to repair boot page corruptions.

First off, what's a boot page? Every database has a single page that stores critical information about the database itself. It's always page 9 in file 1 (the first file in the PRIMARY filegroup). You can examine the page using DBCC PAGE and it will interpret all the fields for you, but there's another command, DBCC DBINFO, that also dumps all this info (in fact the DBCC PAGE code calls the same underlying dumping code). This command is undocumented and unsupported but widely known and 'documented' in lots of places on the web - given that it uses the same code as DBCC PAGE, it's just as safe to use IMHO.

So what's on the boot page?

DBCC DBINFO ('BootPageTest');
GO

DBINFO STRUCTURE:


DBINFO @0x5BF6EF84

dbi_dbid = 19                        dbi_status = 65536                   dbi_nextid = 2073058421
dbi_dbname = BootPageTest            dbi_maxDbTimestamp = 2000            dbi_version = 611
dbi_createVersion = 611              dbi_ESVersion = 0                   
dbi_nextseqnum = 1900-01-01 00:00:00.000                                  dbi_crdate = 2008-07-10 15:53:18.843
dbi_filegeneration = 0              
dbi_checkptLSN

m_fSeqNo = 41                        m_blockOffset = 29                   m_slotId = 55
dbi_RebuildLogs = 0                  dbi_dbccFlags = 2                   
dbi_dbccLastKnownGood = 1900-01-01 00:00:00.000                          
dbi_dbbackupLSN

m_fSeqNo = 0                         m_blockOffset = 0                    m_slotId = 0

dbi_oldestBackupXactLSN

m_fSeqNo = 0                         m_blockOffset = 0                    m_slotId = 0
dbi_LastLogBackupTime = 1900-01-01 00:00:00.000                          
dbi_differentialBaseLSN

m_fSeqNo = 0                         m_blockOffset = 0                    m_slotId = 0

dbi_createIndexLSN

m_fSeqNo = 0                         m_blockOffset = 0                    m_slotId = 0

dbi_versionChangeLSN

m_fSeqNo = 0                         m_blockOffset = 0                    m_slotId = 0
dbi_familyGUID = a4e88c13-b4cf-4320-834e-92b237244d4b                    
dbi_recoveryForkNameStack


entry 0

m_fSeqNo = 0                         m_blockOffset = 0                    m_slotId = 0
m_guid = a4e88c13-b4cf-4320-834e-92b237244d4b                            

entry 1

m_fSeqNo = 0                         m_blockOffset = 0                    m_slotId = 0
m_guid = 00000000-0000-0000-0000-000000000000                            
dbi_differentialBaseGuid = 00000000-0000-0000-0000-000000000000           dbi_firstSysIndexes = 0001:00000014
dbi_collation = 872468488            dbi_category = 0                     dbi_maxLogSpaceUsed = 231936
dbi_localState = 0                   dbi_roleSequence = 0                
dbi_failoverLsn

m_fSeqNo = 0                         m_blockOffset = 0                    m_slotId = 0

dbi_dbmRedoLsn

m_fSeqNo = 0                         m_blockOffset = 0                    m_slotId = 0

dbi_dbmOldestXactLsn

m_fSeqNo = 0                         m_blockOffset = 0                    m_slotId = 0
dbi_dbMirrorId = 00000000-0000-0000-0000-000000000000                    
dbi_pageUndoLsn

m_fSeqNo = 0                         m_blockOffset = 0                    m_slotId = 0
dbi_disabledSequence = 0            
dbi_dvSplitPoint

m_fSeqNo = 0                         m_blockOffset = 0                    m_slotId = 0
dbi_CloneCpuCount = 0                dbi_CloneMemorySize = 0             
DBCC execution completed. If DBCC printed error messages, contact your system administrator.

There's all kinds on interesting things in there, for instance:

  • dbi_version and dbi_createversion: the physical version number of the database (and when it was created). See question 1 in the August 2008 SQL Q&A column in TechNet Magazine for an explanation (see here).
  • dbi_RebuildLogs: a count of the number of times the transaction log has been rebuilt for the database. PSS can use this to tell whether corruption problems could have been caused by DBAs rebuilding the log
  • dbi_dbccLastKnownGood: the completion time of the last 'clean' run of DBCC CHECKDB
  • a bunch of different LSNs related to checkpoint, backups, database mirroring
  • dbi_LastLogBackupTime: self-explanatory
  • dbi_differentialBaseGuid: the GUID generated by the last full database backup. Differential backups can only be restored on top of a matching full backup - so an out-of-band full backup could screw-up your disaster recovery - see this blog post for more info.

Now, what about if this page is corrupt in some way? I corrupted the BootPageTest database to have a corrupt boot page. Let's see what happens:

USE BootPagetest;
GO

Msg 913, Level 16, State 4, Line 1
Could not find database ID 19. Database may not be activated yet or may be in transition. Reissue the query once the database is available. If you do not think this error is due to a database that is transitioning its state and this error continues to occur, contact your primary support provider. Please have available for review the Microsoft SQL Server error log and any additional information relevant to the circumstances when the error occurred.

Okay - let's try setting the database into EMERGENCY mode:

ALTER DATABASE BootPageTest SET EMERGENCY;
GO

Msg 824, Level 24, State 2, Line 1
SQL Server detected a logical consistency-based I/O error: incorrect checksum (expected: 0xcdee22fa; actual: 0xcb6ea2fa). It occurred during a read of page (1:9) in database ID 19 at offset 0x00000000012000 in file 'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\DATA\BootPageTest.mdf'. Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

Hmm. What about running DBCC CHECKDB?

DBCC CHECKDB ('BootPageTest') WITH NO_INFOMSGS, ALL_ERRORMSGS;
GO

Msg 922, Level 14, State 1, Line 1
Database 'BootPageTest' is being recovered. Waiting until recovery is finished.

It's not looking good. Obviously the change to EMERGENCY mode couldn't complete properly. What's the database status?

SELECT [state_desc] FROM sys.databases WHERE [name] = 'BootPageTest';
GO

state_desc
------------------------------------------------------------
RECOVERY_PENDING

The boot page is inaccessible so in effect the database is inaccessible too, this is what the database state means in this case.

So what does this mean? If the boot page is corrupt, you can't run DBCC CHECKDB so you can't possibly run repair, and you can't put the database into EMERGENCY mode so you can't extract data into a new database. It means that there's NO WAY to recover from a corrupt boot page EXCEPT to restore from backups. One more reason to have backups...

Thursday, July 10, 2008 4:01:42 PM (Pacific Standard Time, UTC-08:00)  #    Comments [3]  | 

Well, almost... Kimberly and I have agreed to co-author the upcoming SQL Server 2008 Internals book with our good friend Kalen Delaney. This is the 2008 evolution of Kalen's Inside SQL Server series of books and will be published by MS Press around February 2009. The other authors are (also our good friends) Adam Machanic and Conor Cunningham.

Kimberly's going to write the chapter on Index Internals and I'm going to write the chapter on DBCC Internals. This is very exciting as these are our respective favorite subjects, as you probably already know - these won't be short chapters :-)

It's going to be a busy rest of the year - phew!

PS Next year we have plans to write a book ourselves - watch this space...

Thursday, July 10, 2008 2:37:31 PM (Pacific Standard Time, UTC-08:00)  #    Comments [1]  | 

September, October, and November are going to be a whirlwind this year - after 3 weeks in the UK and Ireland in September, teaching and the San Francisco Power Workshop in October, we have three back-to-back conference weeks in Barcelona, Las Vegas, and back to Seattle! Hey - who booked that schedule?!?!?! Well, at least it helps us keep our top frequent-flyer status on United :-)

Here's the line-up - see our Upcoming Events page for all the abstracts (including those from Bob Beauchemin and Stacia Misner too).

TechEd EMEA IT Pro, November 3-7, Barcelona, Spain

  • We're still working with the TechEd team to finalize the content we'll be delivering but it's looking like the same three sessions from TechEd US, plus a bunch of new ones and Instructor-Led-Labs
  • Sessions (at least):
    • Are Your Indexing Strategies Working?
    • Corruption Survival Techniques: From Detection to Recovery
    • Essential Database Maintenance

SQL Server Connections Fall, November 9-14, Las Vegas, USA

  • This is the second of the twice-yearly SQL Connections conferences that Kimberly and I Co-Chair
  • Workshops:
    • November 9: Pre-pre-con: Database Best Practices for the Involuntary DBA
    • November 10: Pre-con: Relational Data Warehousing: Leveraging Key Features of SQL Server 2005/2008
  • Sessions:
    • Index Internals and Usage
    • Essential Database Maintenance
    • DBCC CHECKDB: The Definitive Guide
    • Follow the Rabbit: Interactive Q&A on Database Maintenance

PASS Community Summit 2008, November 17-21, Seattle, USA

  • As unbelievable as this may be, I've *never* been to PASS before, even in the few times it was in Seattle - something always conflicted. Kimberly hasn't been since 2005 so it'll be cool for us both to be there this year.
  • Workshop:
    • November 18: Database Maintenance: From Planning to Practice to Post-Mortem
  • Spotlight Session:
    • Corruption Survival Techniques: From Detection to Recovery
Thursday, July 10, 2008 1:49:18 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 

People have been complaining that I've stopped blogging so much - vacation folks, vacation! Today I've got a few class and conference posts to get through and then I'll get back to the technical posts.

We're doing a 2.5 day public class based on the SQL Server 2008 material we developed earlier this year. This will be part of a larger conference being hosted by Dev Connections in San Francisco, USA. Our workshop will run October 6th through 8th.

You can register and get more details at http://www.devconnections.com/SFWorkshops/default.asp?s=127.

Here's the abstract:

SQL Server 2008 offers an impressive array of capabilities for professional developers that build upon key innovations introduced in SQL Server 2005. The use of many of these will have manageability and infrastructure implications for a database—and hence the DBA! There are also enhancements to existing high-availability technologies, plus a variety of significant new tools to aid in managing performance, scalability, administration, and troubleshooting. This workshop helps you understand how to exploit the new toolset and how to manage a database that makes use of the new features in SQL Server 2008. The multi-day format of this event allows us to explore each feature in more detail, with more in-depth demonstrations and labs.

Topics covered include:

  • Availability Enhancements
    • Database Mirroring
    • Backup Compression
    • Peer-to-Peer Replication
  • Security Enhancements
    • Transparent Data Encryption
    • Extensible Key Management
    • All Actions Audited
  • Policy-Based Management and Multi-Server Administration
  • Troubleshooting and Throttling
    • Resource Governor
    • Extended Events
  • New Development Technologies
    • Spatial Indexes
    • Sparse Columns
    • Filtered Indexes and Statistics
    • Change Tracking and Change Data Capture
    • FILESTREAM
  • Performance Data Collection
  • Scalability Enhancements
    • Data Compression
    • Partition-Level Lock Escalation

This workshop runs Oct 6 (9am - 4pm), Oct 7 (9am - 4pm), Oct 8 (9am - 12pm).

Thursday, July 10, 2008 11:07:47 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 
Tuesday, July 08, 2008

Well, we're just back from vacation (photo blog post to follow) and I've heard that the feature article on Effective Database Maintenance I wrote for the August issue of TechNet Magazine is live on the web. It also includes a 5 minute long screencast I recorded where I demo the effect of database shrink on index fragmentation.

You can get to the article at http://technet.microsoft.com/en-us/magazine/cc671165.aspx. The topics covered are:

  • Managing data and transaction log files
  • Eliminating index fragmentation
  • Ensuring accurate, up-to-date statistics
  • Detected corrupted database pages
  • Establishing an effective backup strategy

It's written around 2-300 level and presents a good overview (well, at least I think so :-)) of the concepts involved.

Also, the August SQL Q&A column is available at http://technet.microsoft.com/en-us/magazine/cc671180(TechNet.10).aspx. This month's topics on the web (more in the print magazine) are:

  • Database version changes with upgrades
  • Benefits of partitioning
  • Consistency checking options for VLDBs

Enjoy!

Tuesday, July 08, 2008 9:35:10 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 
Thursday, June 26, 2008

One problem (the only one!) of going on vacation with Kimberly is that can be hard to banish SQL Server completely from conversation. Over breakfast this morning we were discussing the pros and cons of advising someone to use sp_attach_single_file_db as a way to shrink an out-of-control transaction log - with careful guidance it can be done, but there's a lot of scope for misuse and getting into trouble.

One problem with being on vacation in general is that your mind wanders away from the normal bounds of rational thought (well, at least mine does...) While discussing the merits of shrinking transaction logs I was cutting up my eggs and mused aloud on how much easier it was to divide an egg in half when it was scrambled compared to when it was raw - you can get a nice Euclidian straight edge. After that Kimberly had nothing else to say about transaction logs :-)

Then I wondered how far away we are from the mainland (we're on Maui for a week, then on a live-aboard dive boat out of Kona - the Kona Aggressor - for another week). Luckily the waitress brought the breakfast check so I spent 5 minutes doing the a2 = b2 + c2 calculation (where a was our flight length from Seattle, b is the distance south from Seattle, and c is the distance from the mainland). Figuring about 2700 miles for the flight, and 2000 miles south of Seattle (and no-doubt convincing everyone around us that I needed to use long multiplication, scientific notation, long division, and geometric figures to calculate the tip on the breakfast check), I came up with roughly 1800 miles as the distance of Hawaii from the mainland. In reality, the distance is about 1625 miles - not bad!

This is my first trip to Hawaii (and Kimberly's fourth, but first to Maui) - it's a very cool place. On Tuesday we took a long helicopter tour around the island (courtesy of Blue Hawaiian Helicopters) which gave us some stunning views of the volcanic scenery (we're doing a similar tour of the Big Island after the dive trip). Today we're going to drive to the top of the 10000 foot volcano to watch the sunset and do some bird-watching. Here are a few photos:

 

 

Ok - back to vacation...

Thursday, June 26, 2008 2:14:03 PM (Pacific Standard Time, UTC-08:00)  #    Comments [2]  | 
Monday, June 23, 2008

As well as the usual round of conferences later this year, we've also organized some public classes in the UK after lots of requests. In between these two classes we'll be hopping over to Dublin to do a launch seminar for Microsoft on SQL Server 2008 - more details on that as they become available.

The UK classes are organized with our UK partners SQLKnowHow.com. We haven't taught in the UK since a one-day seminar we did with Tony Rogerson (one of the founders of SQLKnowHow) back in March last year so this is pretty exciting (and the Edinburgh class will be at my old alma-mater, The University of Edinburgh). The complete line-up is below - register now to avoid disappointment as the classes are filling up fast.

Best Practices in Performance and Availability for SQL Server 2005/2008

  • When: 1st to 3rd September, 2008
  • Where: Hatfield, Hertfordshire
  • Who: Paul and Kimberly
  • How much: See here for details, discounts, and early-bird specials
  • What:

    This class has three primary goals (for almost all topics/modules): planning, practice/implementation and post-mortem - with the largest emphasis on designing/implementing the RIGHT solution. Questions that you must ask are: How do you choose technologies to fit requirements and effectively use key features of SQL Server 2005/2008? How does your technology/choice affect workload performance?

    Only after an in-depth plan is developed should you move on to actual implementation. So what are the areas that you need to consider?

    • Architecting for Availability
    • Architecting for Performance
    • Maintaining Performance and Availability

    And just to be clear, this is not a high-level class on planning. This is an intense, in-depth class encompassing structures, internals, technologies and solutions. Planning is a critical part of performance, high-availability, database maintenance and disaster recovery - but the most-often disregarded.

    Performance tuning spans many areas within SQL Server from database creation to database design to the code you execute (ad-hoc or procedural). A single magic bullet does not exist (indexing is the closest thing to a magic bullet for some queries). However, to achieve a truly scalable and reliable database it takes a variety of best practices - from database creation (including file structure and placement) to table design and creation (using vertical and horizontal partitioning techniques) to system architecture (including disaster recovery planning and implementation) to ongoing maintenance. Whether you're trying to achieve high performance for a few users or scale to support thousands, there are numerous areas that you can tune to improve performance - proactively. But, how do you make this a reality?

    SQL Server 2005 and 2008 provide a variety of options to help keep your database more available. However, even in the event of a disaster, are you sure you know the best path for recovery - with the least amount of downtime and/or data loss? Putting a well-thought out plan into practice requires a thorough understanding of the technologies, their pitfalls and the effects of many technologies when combined. In terms of architecture, we will start by discussing the most important part of designing an available solution - requirements. Then we'll show how to use requirements to drive a technology decision - not the other way around, which happens so often and results in an inadequate implementation.

    No matter how much effort you spend on the design of your database, if you don't maintain it in production then it will suffer from performance and manageability problems - and possibly data loss and/or downtime. The key to availability and performance is well thought-out and automated database maintenance. The final part of the course will discuss maintenance strategies required to keep your carefully designed system available and performing well, plus a primer on recovering from disasters.

    If you're planning, or already manage, an enterprise system and want better performance and availability - then this is the place to be!

    Module List:

      1. Foundations - SQL Server structures and algorithms
      2. Architecting for Availability
      3. Architecting for Performance
      4. Maintaining Performance and Availability

Indexing for Performance in SQL Server 200/2005/2008

  • When: 8th to 9th September, 2008
  • Where: Edinburgh
  • Who: Paul and Kimberly
  • How much: See here for details, discounts, and early-bird specials
  • What:

    There are many areas of performance tuning in SQL Server: database design, application design, hardware/software configuration, and many more. But none are as important as indexing. Creating the "right indexes" is the most important thing you can do for performance and scalability. Is proper indexing something your application is missing? Do you realize the impact of your clustering key; forcing your base structure of your tables to be either ordered or unordered. If ordered is chosen, by what type of column(s) should the data be ordered? Is the decision solely based on query performance or are there other factors?

    Whether your system is 24x7 or a small system just trying to setup for future growth and improved performance this course is for you! We will cover the often-overlooked impacts of poorly chosen clustered indexes, where/why clustered indexes help the most and how the type of table and the type/frequency of your queries affect your decisions. Additionally, once the internals, statistics and base table structures have been defined, we will talk about indexing strategies for search arguments (including SQL Server 2008 Filtered Indexes), joins, aggregations and appropriate uses for indexed views. Finally, we'll discuss index maintenance as well as how to evaluate your indexing strategy over time to make sure it remains appropriate as your data and workload changes.

    If you want better performance and excellent insight into the wide range of indexing strategies - as well as how things work internally, this is the place to be!

    Course Modules

    1. Index Internals
    2. Statistics
    3. Indexing Strategies, Part I: SARGs and Joins
    4. Indexing Strategies, Part II: Aggregations and Indexed Views
    5. Index Maintenance
    6. Is Your Indexing Strategy Working?
Monday, June 23, 2008 5:07:12 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 
Thursday, June 19, 2008

TechEd US is done for another year! As I mentioned before, we did a lot of stuff but still found time to chill by the pool a few times in the Speaker Hotel. This was my first US TechEd since leaving Microsoft last year so it was quite interesting seeing the organizational side of things from the outside. I was particularly pleased that my new Surviving Corruption - From Detection To Resolution session clinched a prestigious top-10 rating (#6) for the whole conference - look out for it at all the other conferences I'll be at this year (next post today...)

Edit: Forgot to say - thanks to all those in the Olympia, WA User Group who came out yesterday to see us present the Surviving Corruption session!

We've already started posting scripts from our session demos (see the Past Conferences page) and I'm blogging detailed walkthroughs of my demos from the corruption session in my CHECKDB From Every Angle series. The online panel we did hasn't been released yet on the TechEd Online site - I'll blog when it is.

Now we're off for a couple of weeks of real vacation - flying, diving, bird-watching, and best of all, not working!

I'll leave you with my usual conference wrap-up... thanks to Carlos Santillana for the photos!

Thursday, June 19, 2008 2:30:20 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 
Wednesday, June 11, 2008

Today I presented my brand new session Surviving Corruption: From Detection to Recovery at TechEd. I had a lot of fun putting together the demos, presenting the session, and talking to people afterwards. During the session, I promised to blog each of the demos so that everyone can run through them - here's the first one.

On SQL 2000, it was pretty easy to get into the system tables and manually change them - all you had to do was:

EXEC sp_configure 'allow updates', 1;
GO
RECONFIGURE WITH OVERRIDE;
GO

And then you could insert, update, and delete whatever you wanted in the all the system tables, including the critical three - sysindexes, sysobjects, and syscolumns. The problem was that sometimes people actually did this and messed things up - for instance, by manually deleting an object from sysobjects, but leaving around all the other info about the object - such as indexes and columns. DBCC CHECKCATALOG in SQL 2000 would find this, but DBCC CHECKDB would not - as it didn't run the DBCC CHECKCATALOG code - any most people do not run DBCC CHECKCATALOG at all. Many times now, I've seen databases upgraded to 2005 and suddenly DBCC CHECKDB is reporting metadata corruption errors - all because someone had manually changed the system tables on 2000, and I changed DBCC CHECKDB in 2005 to include the DBCC CHECKCATALOG checks.

This demo is all about that. I created a 2000 database, manually deleted a row in sysobjects and then upgraded the database to 2005. The corrupt database is available in a zip file - DemoCorruptMetadata.zip. If you unzip it into a folder C:\SQLskills then you can attach it using:

RESTORE DATABASE DemoCorruptMetadata FROM DISK = 'C:\SQLskills\DemoCorruptMetadata.bak'
   
WITH MOVE 'DemoCorruptMetadata' TO 'C:\SQLskills\DemoCorruptMetadata.mdf',
   
MOVE 'DemoCorruptMetadata_log' TO 'C:\SQLskills\DemoCorruptMetadata_log.ldf',
   
REPLACE;
GO

So what does the corruption look like on 2005?

DBCC CHECKDB (DemoCorruptMetadata) WITH NO_INFOMSGS, ALL_ERRORMSGS;
GO
Msg 8992, Level 16, State 1, Line 1
Check Catalog Msg 3853, State 1: Attribute (object_id=1977058079) of row (object_id=1977058079,column_id=1) in sys.columns does not have a matching row (object_id=1977058079) in sys.objects.
Msg 8992, Level 16, State 1, Line 1
Check Catalog Msg 3853, State 1: Attribute (object_id=1977058079) of row (object_id=1977058079,column_id=2) in sys.columns does not have a matching row (object_id=1977058079) in sys.objects.
CHECKDB found 0 allocation errors and 2 consistency errors not associated with any single object.
CHECKDB found 0 allocation errors and 2 consistency errors in database 'DemoCorruptMetadata'.

This is what we expect. Notice that there's no recommended repair level at the end of the output - this is because CHECKDB can't repair metadata corruptions. We can't fix this with a backup - unless we have a backup from 2000 from before the manual delete in the system tables. To fix this we'd need to go back to 2000, fix the corruption, and then upgrade again - usually not feasible.

Instead, we're going to fix it by manually altering the system tables in 2005 - something that's purportedly not possible. First let's see what tables there are that could include column information (remembering that the system catalogs were completely rewritten between 2000 and 2005):

SELECT [name] FROM DemoCorruptMetadata.sys.objects WHERE [name] LIKE '%col%';
GO

name
------------------
sysrowsetcolumns
syshobtcolumns
syscolpars
sysiscols

I know that sysrowsetcolumns and syshobtcolumns are involved at low-levels of the Storage Engine and don't contain relational metadata, so let's try syscolpars. I want to see what columns there are to see if one of the looks like an object ID, and another looks like a column ID. This query will just return the table columns, with no rows (because the condition 1=0 is always false:

SELECT * FROM DemoCorruptMetadata.sys.syscolpars WHERE 1 = 0;
GO

Msg 208, Level 16, State 1, Line 1
Invalid object name 'DemoCorruptMetadata.sys.syscolpars'.

I can't bind to internal system tables in 2005. But - I can bind to internal system tables using the Dedicated Admind Connection (or DAC for short). This is documented in Books Online at http://msdn.microsoft.com/en-us/library/ms179503.aspx. You can get to the DAC through SQLCMD using the /A switch. So - assuming I'm now connected through the DAC, I'll try that command again:

C:\Documents and Settings\paul>sqlcmd /A
1> USE DemoCorruptMetadata;
2> GO
Changed database context to 'DemoCorruptMetadata'.
1> SELECT * FROM sys.syscolpars WHERE 1=0;
2> GO
id          number colid       name

xtype utype       length prec scale collationid status      maxinrow xmlns
 dflt        chk         idtval

----------- ------ ----------- -------------------------------------------------
-------------------------------------------------------------------------------
----- ----------- ------ ---- ----- ----------- ----------- -------- -----------
 ----------- ----------- -------------------------------------------------------
-----------

(0 rows affected)
1>

This looks like the table. Now I'll query against it using the object ID from the original corruption message:

1> SELECT colid, name FROM sys.syscolpars WHERE id = 1977058079;
2> GO
colid       name
----------- --------------------------------------------------------------------
------------------------------------------------------------
          1 SalesID
         
2 CustomerID
(2 rows affected)
1>

Cool. So I'll try deleting the orphaned columns:

1> DELETE FROM sys.syscolpars WHERE id = 1977058079;
2> GO
Msg 259, Level 16, State 1, Server ROADRUNNERPR, Line 1
Ad hoc updates to system catalogs are not allowed.
1>

Hmm. And it doesn't help if I set 'allow updates' to 1, or try putting the database into single-user mode.

There IS a way though. You can put the SERVER into single-user mode, then connect with the DAC and you can then update the system tables. This particular twist on using the DAC isn't documented anywhere except in an MSDN forum thread answered by someone from Microsoft (see here).

BEWARE (if I could put little flashing lights around this too then I would...) that this is undocumented and unsupported - misuse will lead to unrepairable corruption of your databases.

The sequence of events to follow is:

  • make a backup of the database just in case something goes wrong
  • shutdown the server
  • go to the binaries directory (e.g. C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Binn) and start the server in single-user mode using 'sqlservr -m'
  • connect back in using SQLCMD /A, and run the deleta again. This time it will work, but will give an error about metadata cache consistency:

C:\Documents and Settings\paul>sqlcmd /A
1> USE DemoCorruptMetadata;
2> GO
Changed database context to 'DemoCorruptMetadata'.
1> DELETE FROM sys.syscolpars WHERE id = 1977058079;
2> GO

(2 rows affected)
Warning: System table ID 41 has been updated directly in database ID 12 and cache coherence may not have been maintained. SQL Server should be restarted.
1>

  • The system table has been updated, but the in-memory cache of metadata is now out-of-sync with the system tables. So, shutdown the server again as the message suggests and restart it normally
  • run CHECKDB again and you'll see the corruption has been fixed.

Hope this helps some of you. Watch this space for the next demo from TechEd of repairing corruption when no backup is available.

Wednesday, June 11, 2008 5:42:23 PM (Pacific Standard Time, UTC-08:00)  #    Comments [3]  | 
Monday, June 09, 2008

(I'm actually on-stage here at TechEd doing the  DAT track pre-con with Kimberly - she's on now until lunch so I'm catching up on forum problems...)

Here's a question that came up on of the SQLServerCentral.com corruption forums I monitor that I think is worth blogging about. To paraphrase:

I have a bunch of corruptions in a database, that look like they've been there for a while. Repair is my only option - it works but I'd like to know what data is being deleted. How can I do that? Here are some of the errors:

Server: Msg 8928, Level 16, State 1, Line 2
Object ID 645577338, index ID 0: Page (1:168576) could not be processed. See other errors for details.
Server: Msg 8928, Level 16, State 1, Line 2
Object ID 645577338, index ID 0: Page (1:168577) could not be processed. See other errors for details.
Server: Msg 8928, Level 16, State 1, Line 2
Object ID 645577338, index ID 0: Page (1:168578) could not be processed. See other errors for details.
Server: Msg 8928, Level 16, State 1, Line 2
Object ID 645577338, index ID 0: Page (1:168579) could not be processed. See other errors for details.
Server: Msg 8928, Level 16, State 1, Line 2
Object ID 645577338, index ID 0: Page (1:168580) could not be processed. See other errors for details.
Server: Msg 8928, Level 16, State 1, Line 2
Object ID 645577338, index ID 0: Page (1:168581) could not be processed. See other errors for details.
Server: Msg 8928, Level 16, State 1, Line 2
Object ID 645577338, index ID 0: Page (1:168582) could not be processed. See other errors for details.
Server: Msg 8976, Level 16, State 1, Line 2
Table error: Object ID 645577338, index ID 1. Page (1:168576) was not seen in the scan although its parent (1:165809) and previous (1:168575) refer to it. Check any previous errors.
Server: Msg 8978, Level 16, State 1, Line 2
Table error: Object ID 645577338, index ID 1. Page (1:168583) is missing a reference from previous page (1:168582). Possible chain linkage problem.

This is a clustered index that CHECKDB  will repair by deleting pages at the leaf-level - essentially deleting a bunch of records. The pages look to be trashed (there were a bunch more errors that I didn't include here that said the page headers were all corrupted - looked like the IO subsystem trashde a whole 64KB chunk of the disk) so there's nothing much else you can do. As the table has a clustered index, you can use the error messages to find the pages on either 'logical' side of the pages being deleted - and hence figure out the range of records that have been deleted.

The errors show that pages 168576 through 168582 in file 1 are corrupt. There are also errors that say the previous page of 168576 is 168575, and the next page of 168582 is 168583. If you do a DBCC PAGE of these two pages, you can find the lower and upper bound of the clustered index key values that have been lost. Think of three ranges:

  • the lower range of records that are intact, logically before the corrupt pages in the index
  • the range of records that will be deleted by repair
  • the upper range of records that are intact, logically after the corrupt pages in the index

To find the upper bound of the lower range:

DBCC TRACEON (3604); -- allows the output to come to the console
DBCC PAGE ('dbname', 1, 168575, 3);
GO

The key value in the slot at the end of output is the upper bound of the bottom range that's intact.

Then do:

DBCC PAGE ('dbname', 1, 168583, 3);
GO

The key value in the slot at the beginning of the output is the lower bound of the upper range that's intact.

Everything in the middle will be deleted. You could also try a DBCC PAGE on the corrupt pages themselves too - you might be able to see some data in them.

I'll be blogging a bunch more about repair after my corruption session this week at TechEd - watch this space!

Monday, June 09, 2008 7:54:32 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 
Sunday, June 08, 2008

Over the last few weeks I've seen (and helped correct) quite a few myths and misconceptions about index rebuild operations. There's enough now to make it worthwhile doing a blog post (and it's too hot here in Orlando for us to go sit by the pool so we're both sitting here blogging)...

Myth 1:  index rebuild pre-allocates the necessary space

This myth has two variations:

  1. The space for the new copy of the index is pre-allocated
  2. The space for the sort portion of the rebuild is pre-allocated

Neither of these are true. Index rebuild (whether online or offline, and at least as far back as 7.0) will create a new copy of the index before dropping the old copy. The pages and extents required to do this will always be allocated as needed, as with any other operation in SQL Server. The sort phase of an index rebuild, if required (in certain cases it is skipped in 2005), will adhere to the same allocation behavior.

Myth 2: indexes are rebuilt within a single file in a multi-file filegroup

This is a new one that I just heard yesterday - (paraphrasing) "In a two-file filegroup, an index in file 1 will be rebuilt into file 2. The next time it is rebuilt, it will be built in file 1. And so on".

This is untrue. Any time any allocations are done in a multi-file filegroup, the allocations are spread amongst all the files using the allocation system's proportional fill algorithm. In a nutshell, this says that space will be allocated more frequently from larger files with more free space than from smaller files with less free space. There is no concept in SQL Server of limiting allocations to a particular file in a multi-file filegroup.

Myth 3: non-clustered indexes are always rebuilt when a clustered index is rebuilt

This is untrue. The rules are a little complex here but can be summed up as follows:

  • In 2005+, rebuilding a unique or non-unique clustered index (without changing its definition) will NOT rebuild the non-clustered indexes
  • In 2000:
    • Rebuilding a non-unique clustered index WILL rebuild the non-clustered indexes
    • Rebuilding a unique clustered index will NOT rebuild the non-clustered indexes

The first few service packs of 2000 had bugs that changed the behavior of rebuilding unique clustered indexes back and forth - this is the source of much of the confusion around this myth.

For a much more detailed discussion of this, see my blog post from last Fall - Indexes From Every Angle: What happens to non-clustered indexes when the table structure is changed?.

Myth 4: BULK_LOGGED recovery mode decreases the size of the transaction log and log backups for an index rebuild

This myth is partly true.

Switching to the BULK_LOGGED recovery mode while doing an index rebuild operation WILL reduce the amount of transaction log generated, which is very useful for limiting the size of the transaction log file (note I say 'file', not 'files' - you only need one log file).

Switching to the BULK_LOGGED recovery mode while doing an index rebuild will NOT reduce the size of the transaction log BACKUP. Although the operation will be minimally-logged, the next transaction log backup will read all the transaction log since the last backup plus all the extents that were changed by the minimally-logged index rebuild. This will result in a log backup that's almost exactly the same size as for a fully-logged index rebuild. The ONLY time a log backup will contain data extents is when a minimally-logged operation has taken place since the last log backup - see here on MSDN for more info.

If you're considering using the BULK_LOGGED recovery mode, beware that you lose the ability to do point-in-time recovery to ANY point covered by a transaction log backup that contains even a single minimally-logged operation. Make sure that there's nothing else happening in the database that you may need to effectively roll-back with P.I.T. recovery. The operations you should perform if you're going to do this are:

  • In FULL recovery mode, take log backup immediately before switching to BULK_LOGGED
  • Switch to BULK_LOGGED and do the index rebuild
  • Switch back to FULL and immediately take a log backup

This limits the time period in which you can't do P.I.T. recovery.

Myth 5: online index rebuild doesn't take any locks

This myth is untrue. The 'online' in 'online index operations' is a bit of a misnomer.  Online index operations need to take two very short-term table locks. An S (Shared) table lock at the start of the operation to force all write plans that could touch the index to recompile, and a SCH-M (Schema-Modification - think of it as an Exclusive) table lock at the end of operation to force all read and write plans that could touch the index to recompile.

The most recent time this came up on the forums was someone noticing insert queries timing out after an online index rebuild operation had just started. The problem is that the  table lock that online index rebuild needs has to be entered into the grant queue in the lock manager until it can be acquired - and it will stay there until existing transactions that are holding conflicting locks either commit or roll-back. Any transaction that requires a conflicting lock AFTER the index rebuild lock has been queued but not acquired (and then released) will wait behind it in the lock grant queue. If the query timeout is reached before the transaction can get it's lock, it will timeout.

This is still much better than the table lock being held for the entire duration of the index rebuild operation. For more info, checkout this whitepaper on Online Index Operations in SQL Server 2005.

Sunday, June 08, 2008 9:12:56 AM (Pacific Standard Time, UTC-08:00)  #    Comments [6]  | 
Thursday, June 05, 2008

That time has rolled around again and we're flying down to Orlando for TechEd US tomorrow - my first US TechEd since I left Microsoft. We're doing a lot of stuff this year - here's our schedule if you're going to be there:

Monday

  • Full day pre-con seminar: SQL Server 2008 Overview for DBAs

Tuesday

  • 13.15 - 14.30 (Room N230) DAT354 Are Your Indexing Strategies Working?
  • 15.00 - 16.00 (TechEd Online Stage) Panel: Leveraging SQL Server Technologies to Build a Solid High-Availability Strategy
  • 16.00 - 18.00 DAT track booth

Wednesday

  • 10.15 - 11.30 (Room N220D) DAT375 Corruption Survival Techniques: From Detection to Recovery
  • 11.30 - 14.45 DAT track booth
  • 15.00 - 16.00 Blogger's Lounge

Thursday

  • 10.15 - 11.30 (Room S230E) DAT363 Essential Database Maintenance
  • 11.45 - 13.00 Speader Idol judging
  • 14.30 - 18.00 DAT track booth

Hopefully a bunch of you will stop by and say hi - I'm looking forward to seeing some familiar faces and some new ones! I'll try to blog while I'm there on questions I get and I've got some cool demos for the corruption session that I'll be blogging about over the summer.

See you next week...

Thursday, June 05, 2008 1:26:54 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 
Thursday, May 29, 2008

Totally off-topic blog post this time. I haven't blogged in a week as we've been in Chicago and Illinois for the memorial for Kimberly's Father (see here). Everything went really well at the memorial and then the ashes scattering in Lake Michigan - perfect sailing weather! The only fly in the ointment came back to backups again. I bought a very cool video camera to make sure we captured the memorial for posterity (actually I probably went a little over-the-top but the HD picture quality is awesome - Canon XA H1). I video'd the whole memorial, and then out on the boat the next day. The only problem was that I didn't check the tapes before taping on the second day and managed to overwrite half of the memorial video. Should have taken a backup onto my laptop in the evening on the first day but too much rum was drunk in the Columbia Yacht Club in Chicago and I didn't think to check in the morning. Oops. After we've been burned so badly with Kimberly's computer mishaps (see my diatribe here), you'd think we'd have learned by now...

The last few days we've been in Galesburg, IL visiting Kimberly's Mom and Fort Madison, IA visiting her Grandmother. As a bird-watcher, this was paradise as I managed to pick-up eight new bird species for my life-list. Galesburg is the home of the largest railroad switch-yard in the world, and much as I like trains, it seems like most of the 150+ trains per day that go through Galesburg actually go through at night, making lots of noise as the do so - which doesn't make for the best sleep.

Here are some pictures from Galesburg... (click for larger versions)

Okay - so why does the title mention movie plots? And why is this blog post filed under the Bad Advice tag? Well, it would be bad advice for me to recommend you go to see Indiana Jones and the Kingdom of the Crystal Skull, which we saw this evening. We're both *huge* Indiana Jones fans, but this movie was pretty bad. Contrived plot, boring dialog, wooden characters, and a predictable ending. Without giving anything away, the refrigerator scene is totally unbelievable, the accents of the baddies are cliched and awful, and what's with the cutsie gophers at the start? Well, I enjoyed a few bits here and there but I was ready to leave after about half an hour. I can't believe Harrison Ford made this movie... Oh well - I'm sure opinions will vary but I think they should have left the series to end with The Last Crusade back in 1989.

Tomorrow we fly home and next week we're back to work for a little bit before flying out again to TechEd on Friday. And I'll be finishing up some exciting 2008 whitepapers for the SQL teaam and back to blogging about technical stuff.

Thursday, May 29, 2008 4:34:07 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 
Thursday, May 22, 2008

Before I start, I want to make it clear that you can only hit this bug if you ALREADY have corruption, that it's quite rare, and that there is a workaround.

I've noticed a few more people in the forums having CHECKDB fail with this particular error in the last month

Msg 8967, Level 16, State 216, Line 1
An internal error occured in DBCC which prevented further processing. Please contact Product Support.

instead of completing properly and listing the corruptions in the database.

Whenever CHECKDB is using a database snapshot, it must check that the page it read through the snapshot does not have an LSN (Log Sequence Number) higher than that when the snapshot was created. If it did, this would mean that the page was modified AFTER the snapshot was created and hence CHECKDB would be working from an inconsistent view of the database. If this case is discovered, CHECKDB stops immediately. When I rewrote CHECKDB for SQL Server 2005, I changed a bunch of code assertions into seperate states of the 8967 error, so that CHECKDB would fail gracefully if some condition occured that indicates a bug or something that should never happen. State 216 is for the bad LSN condition I've just described.

I used to think it was caused by a race condition with the NTFS code that implements sparse files, which is used by the hidden database snapshot that CHECKDB uses by default. However, I've come to learn that this is a bug in CHECKDB (not one of mine I should say :-)) that causes this behavior under certain circumstances when corruption is present. The bug is that if a corrupt page fails auditing inside CHECKDB, the LSN check is still performed. If the corruption affects the LSN stamped in the page header, the 8967 error could be triggered. I've seen this a handful of times in the last few weeks - hence the need for a blog post. I've discussed this with the dev team and hopefully the fix will make it into the next SPs for 2005 and 2008 (too late to fix such a rare problem in such a critical component at this stage of 2008 development). They're going to put a KB article together too - but in the meantime, I wanted to get this on the Internet so Google/Live Search pick it up.

Now let's repro the problem. Starting with a simple database and table, I'll find the first page so I can corrupt it.

CREATE DATABASE TestCheckdbBug;
GO
USE TestCheckdbBug;
GO
CREATE TABLE test (c1 INT, c2 CHAR (5000));
INSERT INTO test VALUES (1, 'a');
GO
EXEC sp_AllocationMetadata 'test';
GO

Object Name  Index ID  Alloc Unit ID      Alloc Unit Type  First Page  Root Page  First IAM Page
-----------  --------  -----------------  ---------------  ----------  ---------  --------------
test         0         72057594042318848  IN_ROW_DATA      (1:143)     (0:0)      (1:152)

Now I'm going to corrupt the page type on page (1:143) to be 255 (an invalid page type), which will guarantee the page fails the audit inside CHECKDB.

DBCC CHECKDB ('TestCheckdbBug') WITH ALL_ERRORMSGS, NO_INFOMSGS;
GO

Msg 8928, Level 16, State 1, Line 1
Object ID 2073058421, index ID 0, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type In-row data): Page (1:143) could not be processed. See other errors for details.
Msg 8939, Level 16, State 6, Line 1
Table error: Object ID 2073058421, index ID 0, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type In-row data), page (1:143). Test ((m_type >= DATA_PAGE && m_type <= UNDOFILE_HEADER_PAGE) || (m_type == UNKNOWN_PAGE && level == BASIC_HEADER)) failed. Values are 255 and 255.
CHECKDB found 0 allocation errors and 2 consistency errors in table 'test' (object ID 2073058421).
CHECKDB found 0 allocation errors and 2 consistency errors in database 'TestCheckdbBug'.
repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (TestCheckdbBug).

Now I'm going to corrupt the LSN on that page such that it's guaranteed to be higher than the creation LSN of the database snapshot (basically by filling the first part of the page header LSN field with 0xFF).

DBCC CHECKDB ('TestCheckdbBug') WITH ALL_ERRORMSGS, NO_INFOMSGS;
GO

Msg 8967, Level 16, State 216, Line 1
An internal error occurred in DBCC that prevented further processing. Contact Customer Support Services.
Msg 8921, Level 16, State 1, Line 1
Check terminated. A failure was detected while collecting facts. Possibly tempdb out of space or a system table is inconsistent. Check previous errors.

Bingo! And in the error log, there's some diagnostic information so we can tell which page caused the problem:

2008-05-22 14:55:01.95 spid53   DBCC encountered a page with an LSN greater than the current end of log LSN (31:0:1) for its internal database snapshot. Could not read page (1:143), database 'TestCheckdbBug' (database ID 15), LSN = (-1:65535:18), type = 255, isInSparseFile = 0. Please re-run this DBCC command.
2008-05-22 14:55:01.95 spid53   DBCC CHECKDB (TestCheckdbBug) WITH all_errormsgs, no_infomsgs executed by ROADRUNNERPR\paul terminated abnormally due to error state 1. Elapsed time: 0 hours 0 minutes 0 seconds.

Note the page ID (in black bold above) tells us the bad page and the LSN (in blue bold above) reflects the corruption that I caused. If the page ID field of the header was corrupt, it wouldn't be possible to tell from these diagnostics which page is corrupt.

However, all is not lost. This bug means that under these circumstances the default online behavior of CHECKDB can't run. The workaround is to use the WITH TABLOCK option of CHECKDB, which does offline checking and doesn't need the snapshot - but the trade-off is that an exclusive database lock is required for a short time and then shared table locks for all tables in the database (this is why online is the default). Running this option on my corrupt database gives:

DBCC CHECKDB ('TestCheckdbBug') WITH TABLOCK, ALL_ERRORMSGS, NO_INFOMSGS;
GO

Msg 8928, Level 16, State 1, Line 1
Object ID 2073058421, index ID 0, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type In-row data): Page (1:143) could not be processed. See other errors for details.
Msg 8939, Level 16, State 6, Line 1
Table error: Object ID 2073058421, index ID 0, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type In-row data), page (1:143). Test ((m_type >= DATA_PAGE && m_type <= UNDOFILE_HEADER_PAGE) || (m_type == UNKNOWN_PAGE && level == BASIC_HEADER)) failed. Values are 255 and 255.
CHECKDB found 0 allocation errors and 2 consistency errors in table 'test' (object ID 2073058421).
CHECKDB found 0 allocation errors and 2 consistency errors in database 'TestCheckdbBug'.
repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (TestCheckdbBug).

Which are the exact same results we had before I corrupted the LSN field (this is expected, as there is no check of a page's LSN field EXCEPT when running from a database snapshot). Now we can proceed to restore/repair as appropriate.

So - a scary little bug that has caused some people headaches, but I want to stress again - this can only happen if the database is ALREADY corrupt, and that it's quite rare. Hope this helps some of you picking this up from search engines in the future.

Thursday, May 22, 2008 2:20:58 PM (Pacific Standard Time, UTC-08:00)  #    Comments [1]  | 
Monday, May 19, 2008

My first magazine article is in print! I've taken over the bi-monthly SQL Q&A column for TechNet Magazine and I just received the June magazine in the mail today with my first column in it. Topics covered are:

  • Creating corruption and using page checksums
  • The shrink-grow-shrink-grow trap
  • How many databases can be mirrored per instance
  • A tip on changing the default server port, from Jens Suessmeyer

I've also just completed a feature article for either the July or August issue dealing with database maintenance for the 'involuntary' DBA - more details when it gets published.

If you don't get the print version of TechNet Magazine, you can get to this month's SQL Q&A column at http://technet.microsoft.com/en-us/magazine/cc510328.aspx. There may not be anything new if you've been following my blog for a while, but if you've just started, it's worth a quick look.

Enjoy!

PS Let me know if you've got any good questions - I've already completed the August column but I'd like to hear of any questions you may have for later columns.

Monday, May 19, 2008 11:06:59 AM (Pacific Standard Time, UTC-08:00)  #    Comments [4]  | 

Theme design by Jelle Druyts

Pick a theme: