Disaster recovery 101: fixing a broken system table page

This post is about a disaster-recovery scenario I described in our bi-weekly newsletter a couple of weeks ago, and wanted to make sure it’s out on the web too for people to find and use.

I was helping someone try to recover data from a corrupt database, from an online forum question. They did not have any up-to-date backups without the corruption in, so fixing their backup strategy was a piece of advice they were given by a few people.

The output from DBCC CHECKDB on the database was:

Msg 8921, Level 16, State 1, Line 1
Check terminated. A failure was detected while collecting facts. Possibly tempdb out of space or a system table is inconsistent. Check previous errors.
Msg 824, Level 24, State 2, Line 1
SQL Server detected a logical consistency-based I/O error: torn page (expected signature: 0x0; actual signature: 0x5555300). It occurred during a read of page (1:58) in database ID 10 at offset 0x00000000074000 in file ‘D:\dbname.mdf:MSSQL_DBCC10’. Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

They’d tried running repair, but of course if DBCC CHECKDB says that it has to stop (i.e. error message 8921), then it can’t run repair.

I explained this, and how page 1:58 is a system table page and unrepairable, and so they’d have to script out as much of the database schema as possible, create a new database, and extract as much data as possible from the broken database.

I also explained that the page is part of the sys.syscolpars table, which is the equivalent of the old syscolumns system table, so that approach might not work if the corruption was such that it stopped the Query Processor from being able to use the table metadata.

Unfortunately my suspicions were correct, and the script/extract approach did indeed fail.

On a whim, I suggested trying something radical. A few years ago I blogged about a way to ‘fix’ broken boot pages using a hex editor to overwrite a broken boot page with one from an older copy of the database (see here) and demonstrated it at various conferences. I’d never tried it on a system table page before, but I figured that the page ID was low enough that the page likely hadn’t changed for a while.

What do I mean by that? Well, the sys.syscolpars clustered index is ordered by object ID, so the first few pages in the clustered index (of which page 1:58 is one), have the columns from the system tables, with very low object IDs. There’s never going to be the case where a new user table gets created and causes an insert into one of these low tables.

This means that an older backup of the database would have the current state of page 1:58 in it. So I suggested using the boot page hack on page 1:58 from the person’s older backup.

And it worked!

Luckily there wasn’t any other corruption in the database, so all the person had to do was root-cause analysis and remediation, and fixing the backup strategy so the situation wouldn’t arise in future.

Summary: In a disaster situation, when backups aren’t available; don’t be afraid to try something radical. As long as you try it on a copy of the database, it’s not as if you can make the situation any worse. And if you’re lucky, you’ll be able to make the situation a lot better.

Summer 2018 classes in London open for registration

Due to popular demand, we’re coming back to London in 2018 and I’ve just released our classes for registration!

All classes have discounts for registering before the end of 2017! (details on the individual class web pages…)

Our classes in September in London will be:

  • IEPTO1: Immersion Event on Performance Tuning and Optimization – Part 1
    • September 10-14
  • IEAzure: Immersion Event on Azure SQL Database and Azure VMs
    • September 10-11
  • IECAG: Immersion Event on Clustering and Availability Groups
    • September 12-13
  • IEPTO2: Immersion Event on Performance Tuning and Optimization – Part 2
    • September 17-21

You can get all the logistical, registration, and curriculum details by drilling down from our main schedule page.

We hope to see you there!

TSQL Tuesday #96: Folks Who Have Made a Difference

It’s been almost three years since I wrote a T-SQL Tuesday post (shame on me!), but this is one I definitely want to contribute to. It’s hosted by Ewald Cress and is about “the opportunity to give a shout-out to people (well-known or otherwise) who have made a meaningful contribution to your life in the world of data.”

There are three people I want to call out, in the order that they came into my life and helped me out.

Firstly, Dave Campbell, who left Microsoft as a Technical Fellow last year after 22 years in Microsoft’s world of data. When I joined the SQL Server team from DEC in 1999, Dave had already been there 5 years and was the Development Lead of the Access Methods team in the Storage Engine. Dave has always been a brilliant engineer, a calm and insightful manager, and a willing mentor. He taught me a lot about engineering, managing crises, and being a manager. I was amazed in late 2003 to find myself becoming the Dev Lead of the Access Methods team and stepping into his shoes.

I’m sad to say that over the years I’ve lost touch with Dave, but I’m forever grateful for the influence he had on my professional career.

Secondly, my great, great friend Bob Ward. I first met Bob a few months into my tenure at Microsoft and continued to meet and swap emails around Product Support matters but I didn’t start working closely with him until a few years later. Bob was the inspiration for me to want to help customers: to help them find why SQL Server was broken for them, to fix bugs, and to make sure that people in Product Support were saying and doing the right thing for customers. He inspired me because that was his passion, and his entire job. We’d spend many hours on the phone each week and through emails discussing things and sorting stuff out. This led me to champion adding an entire pillar to the new engineering process that came 2/3 through SQL Server 2005 development: supportability, making sure all facets of the SQL Server box could be understood and debugged by Product Support. This involved driving and coordinating all development teams to build and deliver training materials on how SQL Server worked, how to debug it, and how Product Support should approach it AND build into each area the tools, messages, and hooks to allow such investigations to be done.

Bob and I (and Bob’s lovely wife Ginger, plus Kimberly) continue to be close friends and we get together whenever we can (which is a lot more frequently now that Bob’s in the product group and up in Redmond regularly). Of all the people I met at Microsoft, Bob made the greatest contribution to who I am today by inspiring me to help people.

Thirdly, my wonderful wife Kimberly, who helped me develop my speaking skills and made me ‘less Paul’, as she puts it (learning humility, presenting with empathy, and removing a lot of the arrogance with which I left Microsoft). I’d just started presenting when I met Kimberly at TechEd 2006 in Boston and I had a *lot* to learn. I quickly adopted her style of presenting, which works for me. This involves going against one of the central things people are taught about presenting – few bullets with few words. We both (and all of SQLskills) have dense slides with lots of bullets. This is so that people can read the deck and tell what we’re talking about, rather than having pictures of kittens, trees, race-cars, whatever, which tell you nothing several months later. Some of you will disagree – each to their own. The central theme though is making sure that people have learned and understand how and why things are, not just what the answer is.

The other thing (among so many in my life since meeting her) that I want to thank Kimberly for here is for SQLskills. Kimberly’s been a successful business owner since the early 1990s and since she started SQLskills.com in 1995. It was incredibly cool that I could leave Microsoft in 2007 and walk straight into a thriving business with a stellar reputation and start teaching and consulting.

You’ll notice that I didn’t say ‘lastly’ above – I said ‘thirdly’. There are two more groups of people I want to give a shout out to.

Firstly, the incredibly-talented group that work with us at SQLskills (Erin, Glenn, Jon, Tim, and previously Joe Sack – another great friend). I continually learn new things from them and I’m sincerely thankful that they chose to work at SQLskills for so long (Jon for 6+ years, Erin and Glenn for 5+ years, and Tim for almost 3 years). They’re all experts in their specialties and immensely capable people, who keep me on my toes and who are all wonderful people and friends.

Lastly, and most importantly, the people who’ve had the most influence in my data world are the SQL Server community; my fellow MVPs, all the MCM community, everyone who’s come to a class, attended a session, read a blog post or article, watched a Pluralsight course, posted a question, or tweeted on #sqlhelp. A huge part of my personality is helping people understand SQL Server. It’s what drives me to blog, to answer random email questions, put together the waits library, teach, and more.

You’ve all helped shape me into the person I am today in the data world, and I thank you sincerely for it.

 

New Pluralsight course: Understanding and Performing Backups

My latest Pluralsight course has been published – SQL Server: Understanding and Performing Backups. It’s 2.5 hours long, and from the course description:

It’s really impossible to escape the need for backups even if you have some sort of synchronous replica of your databases, disasters can still happen that require restoring from backups. This means it’s critical that you understand what backups are and how to create them. In this course, SQL Server: Understanding and Performing Backups, you’ll learn what SQL Server backups are, why they’re required, and how they work. First, you’ll explore how to formulate a backup strategy based on business requirements and what restores you may need to perform. Next, you’ll delve into commonly used backup options. Finally, you’ll discover how to use backup compression, backup encryption, and maintain backup integrity. When you’re finished with this course, you’ll have the skills and knowledge to confidently perform backups in your environment and create a backup strategy.

The modules are:

  • Introduction
  • Full and Differential Data Backups
  • Transaction Log Backups
  • Log Backup Options and Considerations
  • Backup Strategies
  • General Backup Options
  • Backup Compression, Encryption, and Integrity

Check it out here.

We now have more than 165 hours of SQLskills online training available (see all our 56 courses here), all for as little as $29/month through Pluralsight (including more than 6,000 other developer and IT training courses). That’s unbeatable value that you can’t afford to ignore.

Enjoy!

Calling all user group leaders! We want to present for you in 2018!

By the end of December, we at SQLskills will have presented remotely (and a few in-person) to more than 100 user groups and PASS virtual chapters around the world in 2017!

We’d love to present remotely for your user group in 2018, anywhere in the world. It’s not feasible for us to travel to user groups or SQL Saturdays unless we’re already in that particular city, but remote presentations are easy to do and are becoming more and more popular. We haven’t had any bandwidth problems doing remote presentations in 2017 to groups as far away as South Africa, Australia, and New Zealand, plus Norway, Canada, UK, Poland, Belgium, Brazil, Czech Republic, and Ireland. This way we can spread the community love around user groups everywhere that we wouldn’t usually get to in person.

Note: we have our own Webex accounts which we generally use, or we can use your GoToMeeting or Webex, but you must use computer audio – we won’t call in by phone as the sound quality is too poor. We also will not use Skype/Lync as we’ve had too many problems with it around user group laptops and sound.

So, calling all user group leaders! If you’d like one of us (me, Kimberly, Jon, Erin, Glenn, Tim) to present remotely for you in 2018 (or maybe even multiple times), send me an email and be sure to include:

  • Details of which user group you represent (and if sending from a shared user group account, your name)
  • The usual day of the month, meeting time, and timezone of the user group
  • Which months you have available, starting in January 2018 (a list of available dates would be ideal)
  • Whether you’d like just one or multiple

And I’ll let you know who’s available with what topics so you can pick. We have around 20 topics across the team that we can present on.

What’s the catch? There is no catch. We’re just continuing our community involvement next year and we all love presenting :-)

There’s no deadline for this – send me an email at any time and we’ll see what we can do.

We’re really looking forward to engaging with you all!

Cheers

PS By all means pass the word on to any SharePoint and .Net user group leaders you know too.

New Pluralsight course: Understanding, Configuring, and Troubleshooting Log Shipping

Glenn’s latest Pluralsight course has been published – SQL Server: Understanding, Configuring, and Troubleshooting Log Shipping. It’s 90-minutes long, and from the course description:

SQL Server log shipping is a very useful technology for high availability and disaster recovery, as well for general purpose database movement, migration, and upgrades. In this course, SQL Server: Understanding, Configuring, and Troubleshooting Log Shipping, you’ll first learn the basics of log shipping, its capabilities, and its limitations. Next, you’ll learn how to prepare an environment for log shipping, and then how to configure log shipping, using either SQL Server Management Studio or T-SQL scripts. You’ll learn how to monitor a log shipping environment and troubleshoot common problems, and finally you’ll learn how to use log shipping for data migration and upgrading a database to a new version of SQL Server. When you’ve finished this course, you’ll have the skills and knowledge to start configuring, using, and troubleshooting log shipping.

The modules are:

  • Introduction
  • Log Shipping Overview
  • Log Shipping Preparation Tasks
  • Configuring Log Shipping
  • Monitoring and Troubleshooting Log Shipping
  • Migrating Data Using Log Shipping

Check it out here.

We now have more than 165 hours of SQLskills online training available (see all our 55 courses here), all for as little as $29/month through Pluralsight (including more than 6,000 other developer and IT training courses). That’s unbeatable value that you can’t afford to ignore.

Enjoy!

New class: IEBISec on Securing Your BI Platform

We’ve added a second brand-new class to our Spring line-up in Chicago: Immersion Event on Securing Your BI Platform.

It’s a 2-day class, taught by MVP and industry expert Stacia Varga, presented on May 3-4, 2018.

By the end of this class, you’ll understand the relationship across the security settings not only in the BI tools, but also the back-end databases and the Windows operating system. As the class examines potential security issues, you’ll learn how to build a security action plan for Integration Services, Analysis Services, and Reporting Services. Is Azure in your environment? The class also covers important aspects of cloud security for a BI solution. Come learn how to preserve and protect your data effectively from the inside out.

Target audience:

  • DBAs who need to understand and mitigate the security risks of the BI components in the data platform
  • BI architects and developers who need to confidently design and build secure BI solutions

The modules are as follows:

  • Introducing Security Principles
  • Reviewing Integration Services Security
  • Reviewing Analysis Services Security
  • Reviewing Reporting Services Security
  • Azure BI Security
  • Building a Security Action Plan

You can read a more detailed curriculum here and all the class registration and logistical details are here.

We hope to see you there!

New class: IEBIStrat on Developing a BI and Analytics Strategy

We’ve added a brand-new class to our Spring line-up in Chicago: Immersion Event on Developing a BI and Analytics Strategy.

It’s a 3-day class, taught by MVP and industry expert Stacia Varga, presented on April 30 – May 2, 2018.

By the end of the class, you’ll have some new ideas and inspiration to get started with your own BI and analytics roadmap and understand which technologies and skills are needed to build a foundation for your organization’s next-generation BI, including:

  • How the pieces fit together to create a modern analytics architecture
  • How to select the tools that best suit your organization’s analytical requirements
  • How to prepare yourself and/or your staff to implement these tools

Target audience:

  • DBAs who need to support a technical infrastructure for BI, analytics, and data science
  • Managers who need a better understanding of how these technologies fit together to manage data as a strategic asset
  • IT professionals with a data background who want to learn how the BI space is evolving and want to prepare for the future

The modules are as follows:

  • Understanding Traditional vs. Modern BI
  • Establishing a Maturity Baseline
  • Ingesting Data
  • Cataloging Data
  • Preparing Data for Analysis
  • Storing Data for Analysis
  • Analyzing Data
  • Publishing Data
  • Consuming Data
  • Preparing Your Roadmap

You can read a more detailed curriculum here and all the class registration and logistical details are here.

We hope to see you there!

SQLskills SQL101: How can corruptions disappear?

As Kimberly blogged about earlier this year, SQLskills has an ongoing initiative to blog about basic topics, which we’re calling SQL101. We’re all blogging about things that we often see done incorrectly, technologies used the wrong way, or where there are many misunderstandings that lead to serious problems. If you want to find all of our SQLskills SQL101 blog posts, check out SQLskills.com/help/SQL101.

Every so often I get an email question from a confused DBA: why is it that sometimes corruptions seem to disappear?

The situation is commonly as follows:

  • There is a regular SQL Agent job that runs DBCC CHECKDB
  • One morning the DBA finds that the job failed, reporting corruptions in one of the databases
  • The DBA runs DBCC CHECKDB on that database again, but this time there are no reported corruptions

This can lead the DBA to mistrust DBCC CHECKDB. Remember the SQL Server 2000 days where sometimes DBCC CHECKDB occasionally reported corruptions when there weren’t any? Those days are long gone now: if DBCC CHECKDB reports corruption, then at that time that it ran there was definitely corruption.

Think about what DBCC CHECKDB is doing: it reads and processes all the allocated pages in the database – all the pages that are part of tables and indexes at the time that DBCC CHECKDB runs. It doesn’t check all the pages in the data files; only those that are currently being used. The pages that are not currently allocated to an object cannot be checked as there’s no “page history” maintained. There’s really no way for DBCC CHECKDB to tell if they have ever been used before or not and since they’re not currently allocated there’s no valid page structure on them and no past to verify.

And if your database is still being accessed then the set of allocated pages can change after DBCC CHECKDB runs. A simple example of this occurring is:

  • Nonclustered index X of table Y has some corrupt pages in, which the DBCC CHECKDB (being run by a SQL Agent job) reports
  • Another SQL Agent job runs and performs index maintenance where it rebuilds index X (the rebuild operation always builds a new index and then drops the old index)
  • The DBA runs DBCC CHECKDB manually and there are no corruptions reported in the new index structure

Nonclustered index corruption is the best kind of corruption to have. The rebuild operation rewrote the index to a new set of pages and deallocated the pages that had corruption. When DBCC CHECKDB is run manually, those new pages are not corrupt and the old pages are not checked, as they are no longer in use.

These kind of ‘disappearing’ corruptions are a problem because it’s almost impossible to investigate them further. However, they could indicate a problem with your I/O subsystem. If you find that they’re occurring repeatedly, consider briefly preventing the process that causes the corrupt pages to be deallocated so you can investigate the corruption.

Another cause of disappearing corruptions can be transient I/O subsystem problems, where page reads sometimes fail outright and then succeed after that. Take a look at these blog posts on read-retry and Agent alerts for more information.

And yet one more cause could be that the database is mirrored or is part of an availability group and the page was fixed by automatic page repair before the second DBCC CHECKDB occurred. You can look in the msdb.dbo.suspect_pages table (more details here) for an entry for the broken page(s) with event_type of 4.

Bottom line: From SQL Server 2005 onward, if DBCC CHECKDB reports corruption, then at the time that it ran there definitely was corruption. Make sure you don’t just ignore the problem as next time the corruption occurs, you may not be so ‘lucky’ that it just seemed to disappear.

SQLskills SQL101: Should you kill that long-running transaction?

As Kimberly blogged about earlier this year, SQLskills has an ongoing initiative to blog about basic topics, which we’re calling SQL101. We’re all blogging about things that we often see done incorrectly, technologies used the wrong way, or where there are many misunderstandings that lead to serious problems. If you want to find all of our SQLskills SQL101 blog posts, check out SQLskills.com/help/SQL101.

One of the problems you may have encountered is a transaction log file growing because of a long-running query.

What do you do?

The temptation is to kill the query, but is that the correct approach? Well, as always, the answer starts with ‘it depends’.

A long-running query, no matter how much work it’s done (and hence how many log records it has generated) will prevent the log from clearing, as the log all the way back to the LOP_BEGIN_XACT log record of the long-running transaction is required, in case that transaction rolls back. And the log will not be able to clear until (at least) that long-running transaction has committed or finished rolling-back.

You can tell how many log records a transaction has generated using my script here, along with the total space taken up in the log by the transaction.

If the long-running query has generated hardly any log records, then killing it will mean that it rolls-back quickly and then hopefully the next log backup (in the full and bulk-logged recovery models) or checkpoint (in the simple recovery model) will allow the log to clear and stop its growth.

However, if the long-running query has generated a *lot* of log records, then it’s going to take a long time to roll back (as rolling back each log record means generating the ‘anti-operation’ for that log record, making the change, and generating *another* log record describing the change). That rollback itself won’t cause the log to grow any more, as a transaction always reserves free space in the log to allow it to roll back without requiring log growth (my script above also tells you that amount of space). However, as it will take a long time to roll back, other activity in the log from other transactions will likely cause the log to grow more until it’s finally able to clear.

So it may actually be better to allow a long-running transaction that’s generated a lot of log records to continue running until it completes. If the time to completion is going to be a lot less than the time to roll back, this could mean less overall extra log growth until you’re able to finally clear the log, and then potentially resize it back to normal, and continue running.

The trick is knowing what the query is doing and/or being able to figure out how close to completion it is. You could look at is the logical_reads column in the DMV sys.dm_exec_requests and correlate that with the number of pages in the index or table being scanned, or look at the number of log records generated in the script output and correlate that to the number of records you’d expect an UPDATE statement to perform.

Bottom line: don’t always knee-jerk and decide to cancel a problematic, long-running query without thinking about what the effect of that will be.