The Accidental DBA (Day 30 of 30): Troubleshooting: Transaction Log Growth

This month the SQLskills team is presenting a series of blog posts aimed at helping Accidental/Junior DBAs ‘keep the SQL Server lights on’. It’s a little taster to let you know what we cover in our Immersion Event for The Accidental/Junior DBA, which we present several times each year. You can find all the other posts in this series at http://www.SQLskills.com/help/AccidentalDBA. Enjoy!

For the final post in our 30-day series I want to cover one of the most common problems for Accidental DBAs to come across – uncontrolled transaction log growth.

I remember an email we were sent a few years ago, which basically said: I’ve got a 10-GB database and a 987-GB transaction log. What happened?!?!

The number 1 cause of uncontrolled transaction log growth is the following:

  1. Someone creates a database, where the default recovery model is Full; or someone changes the recovery model of a database to Full (remember I explained about these back on day 7 of the series.)
    • Even though the recovery model is Full, the database behaves as if it’s using the Simple recovery model, where the transaction log will automatically clear (to allow reuse and help prevent growth) whenever a checkpoint occurs. This is called the pseudo-Simple recovery model. This situation continues until a database backup is performed.
  2. Someone performs a database backup, completing the transition of the database into the Full recovery model.
    • This database backup provides a basis for further log backups to be performed, and so the log will no longer clear until a log backup is performed.
  3. Nobody performs log backups.

As no-one is performing log backups, and the database is properly using the Full recovery model, the log will never clear. As the log doesn’t clear, it will eventually fill up and then require more space for more log records, and so it will automatically grow. And grow, and grow, until the disk volume runs out of space.

The trick here is simple: perform regular log backups of databases using the Full (or Bulk-Logged) recovery model.

This is all well and good as long as you know why the log cannot clear, as you may be taking log backups and the log is still growing because it cannot clear. SQL Server knows why the log can’t clear, and you can ask it using the following code:

SELECT [log_reuse_wait_desc] FROM sys.databases;
GO

This will list the reasons why the log could not clear, for all databases, at the time the most recent log clearing was attempted. In the case I described above, the output for the database would be LOG_BACKUP, and there are a bunch of other reasons, including long-running transactions, unreplicated transactions, active data backups, and more. You can read the full list in SQL Server Books Online topic Factors That Can Delay Log Truncation.

Whatever the reason is that has made your database’s log grow, take corrective action to allow the log to clear, and then take preventative action to stop the log having to grow again. And if your log has already grown really large, you’ll want to shrink it down to a more reasonable size, and you can read more details on how to do that in Kimberly’s (old but still relevant) post here.

Apart from the log growing and potentially running out of space, log growth is problematic because:

  • Every time the log grows, the new portion of the log has to be zero initialized, which takes some time and causes the write operations on the database to pause.
  • Every time the log grows, more internal chunks are added to the log, and the more of these there are, the more performance impact there is on log operations (Kimberly’s post discusses this too).
  • As the log gets larger, the potential for more downtime during a disaster increases. If the entire database is destroyed, restoring from backups involves recreating the data and log files. The larger the log file is, the longer this takes because of zero initialization again.

Careful transaction log management is required as part of being responsible for a SQL Server instance, whether as an Accidental or experienced DBA. You can learn much more about the transaction log in my online training course SQL Server: Logging, Recovery, and the Transaction Log.

Finally, we really hope you’ve enjoyed reading our post-a-day series during June as much as we’ve enjoyed writing it, and we also hope you’ve learned a bunch of information that will help you increase your DBA skills. To continue your DBA education, make sure to follow our blogs, subscribe to our bi-weekly SQLskills Insider newsletter, check out our online training, and consider coming to one of our in-person training classes – for the ultimate learning experience!

Thanks – Paul & the SQLskills team

The Accidental DBA (Day 27 of 30): Troubleshooting: Tempdb Contention

This month the SQLskills team is presenting a series of blog posts aimed at helping Accidental/Junior DBAs ‘keep the SQL Server lights on’. It’s a little taster to let you know what we cover in our Immersion Event for The Accidental/Junior DBA, which we present several times each year. You can find all the other posts in this series at http://www.SQLskills.com/help/AccidentalDBA. Enjoy!

One of the most common performance problems that exists in SQL Server instances across the world is known as tempdb contention. What does that mean?

Tempdb contention refers to a bottleneck for threads trying to access allocation pages that are in-memory; it has nothing to do with I/O.

Consider the scenario of hundreds of concurrent queries that all create, use, and then drop small temporary tables (that by their very nature are always stored in tempdb). Each time a temp table is created, a data page must be allocated, plus an allocation metadata page to keep track of the data pages allocated to the table. This requires making a note in an allocation page (called a PFS page – see here for in-depth info) that those two pages have been allocated in the database. When the temp table is dropped, those pages are deallocated, and they must be marked as such in that PFS page again. Only one thread at a time can be changing the allocation page, making it a hotspot and slowing down the overall workload.

Back in SQL Server 2005, my dev team at Microsoft implemented a small cache of temp tables, to try to reduce this contention point, but it’s only a small cache, so it’s very common for this contention to be an issue, even today.

What’s really interesting though, is that many people don’t realize they have this problem – even seasoned DBAs. It’s really easy to figure out whether you have this kind of problem using the sys.dm_os_waiting_tasks DMV. If you run the query I have below, you’ll get an idea of where the various threads on your server are waiting, as Erin discussed earlier this month.

SELECT
    [owt].[session_id],
    [owt].[exec_context_id],
    [owt].[wait_duration_ms],
    [owt].[wait_type],
    [owt].[blocking_session_id],
    [owt].[resource_description],
    CASE [owt].[wait_type]
        WHEN N'CXPACKET' THEN
            RIGHT ([owt].[resource_description],
            CHARINDEX (N'=', REVERSE ([owt].[resource_description])) - 1)
        ELSE NULL
    END AS [Node ID],
    [es].[program_name],
    [est].text,
    [er].[database_id],
    [eqp].[query_plan],
    [er].[cpu_time]
FROM sys.dm_os_waiting_tasks [owt]
INNER JOIN sys.dm_exec_sessions [es] ON
    [owt].[session_id] = [es].[session_id]
INNER JOIN sys.dm_exec_requests [er] ON
    [es].[session_id] = [er].[session_id]
OUTER APPLY sys.dm_exec_sql_text ([er].[sql_handle]) [est]
OUTER APPLY sys.dm_exec_query_plan ([er].[plan_handle]) [eqp]
WHERE
    [es].[is_user_process] = 1
ORDER BY
    [owt].[session_id],
    [owt].[exec_context_id];
GO

Note that the [est].text line does not have text delimited – it throws off the plugin.

If you see a lot of lines of output where the wait_type is PAGELATCH_UP or PAGELATCH_EX, and the resource_description is 2:1:1 then that’s the PFS page (database ID 2 – tempdb, file ID 1, page ID 1), and if you see 2:1:3 then that’s another allocation page called an SGAM (more info here).

There are three things you can do to alleviate this kind of contention and increase the throughput of the overall workload:

  1. Stop using temp tables
  2. Enable trace flag 1118 as a start-up trace flag
  3. Create multiple tempdb data files

Ok – so #1 is much easier said than done, but it does solve that problem :-) Seriously though, you might find that temp tables are a design pattern in your environment because they made a query go faster once and then everyone started using them, whether they’re *really* needed or not for enhancing performance and throughput. That’s a whole other topic though and outside the scope of this post.

#2 prevents contention on the SGAM pages by slightly changing the allocation algorithm used. There is no downside from having this enabled, and I even say that all SQL Server instances across the world should have this trace flag enabled by default (and I said the same thing when I ran the dev team that owned the allocation code in the SQL Server Storage Engine).

#3 will help to remove the PFS page contention, by spreading the allocation workload over multiple files, thus reducing contention on the individual, per-file PFS pages. But how many data files should you create?

The best guidance I’ve seen is from a great friend of mine, Bob Ward, who’s the top Escalation Engineer in Microsoft SQL Product Support. Figure out the number of logical processor cores you have (e.g. two CPUS, with 4 physical cores each, plus hyperthreading enabled = 2 (cpus) x 4 (cores) x 2 (hyperthreading) = 16 logical cores. Then if you have less than 8 logical cores, create the same number of data files as logical cores. If you have more than 8 logical cores, create 8 data files and then add more in chunks of 4 if you still see PFS contention. Make sure all the tempdb data files are the same size too. (This advice is now official Microsoft guidance in KB article 2154845.)

Here’s an example screenshot of a 100-connection workload on tempdb.

tempfiles

On the left was when I had a single tempdb data file and I was seeing nearly all the connections waiting for PAGELATCH_UP on 2:1:1. After adding three more tempdb data files and stabilizing, the throughput jumped, as you can clearly see.

This has just been an overview of this problem, but as I said, it’s very common. You can get more information from the links below:

The Accidental DBA (Day 11 of 30): Backups: Backup Storage and Retention

This month the SQLskills team is presenting a series of blog posts aimed at helping Accidental/Junior DBAs ‘keep the SQL Server lights on’. It’s a little taster to let you know what we cover in our Immersion Event for The Accidental/Junior DBA, which we present several times each year. You can find all the other posts in this series at http://www.SQLskills.com/help/AccidentalDBA. Enjoy!

There are two very common mistakes that DBAs (accidental or otherwise) make around backups:

  1. Storing the only copy of the backups on the same I/O subsystem as the databases
  2. Only keeping the most recent full backup, and differential/log backups since then

Backup Storage

The worst place in the world to store your backups is on the same portion of the I/O subsystem as the databases themselves.

If that I/O subsystem has a problem that causes corruption in the databases, the odds are that the same problem will affect the backups. Similarly, if someone makes a mistake and accidentally re-formats the I/O subsystem, not only will it destroy the databases but also the backups too.

How can you recover if your backups have also been corrupted or destroyed? You can’t. Or at least you can’t restore from your backups. I’ve heard of many cases where companies have lost a lot of money, business, respect, and jobs because of backups being co-located with databases.

Backups need to at least be stored on a different I/O subsystem from the databases, on a different server. That way a localized failure will not damage the backups as well.

Even that’s not good enough though. What if the entire data center is damaged or shut down (think about catastrophic power loss without any backup power source)? If the only backups are stored onsite, recovery will not be possible. For this reason, it’s highly advisable to keep copies of  your backups at an offsite location that’s easily accessible with a reasonable amount of time.

Note that I said ‘copoes of your backups at an offsite location’. You shouldn’t have your backups stored only offsite, because most of the time a disaster is localized and the entire data center is not affected. You just need to cope with the case when it is. So you should have onsite and offsite copies of your backups, to allow for fast access to cope with localized problems as well as slower access when a more comprehensive disaster occurs.

A common backup storage strategy involves first backing up locally to disk storage, to a network share or to local storage and then copied to a network share, and then copying the backups to an archive system onsite plus creating copies to ship offsite.

Whatever your strategy becomes, just don’t have only a single, local copy of your backups, stored with the databases; that’s the worst possible way to do it.

Backup Retention

One of the issues you may have is that you only have limited onsite and offsite storage space for your backups – so how long do you keep your backups for?

Imagine a timeline of backups: F1 L1 L2 D1 L3 L4 D2 L5 L6 D3 L7 L8 F2 L9 L10 D4 L11 L12 D5 L13 L14 D6 L15 L16

You may wonder why you need to keep any backup prior to the most recent full backup, F2. Think about this – what if the F2 full backup is corrupt?

If you haven’t kept any backups prior to F2, without the F2 backup to start your restore sequence, all the backups you have after F2 are completely useless.

If you can’t use the most recent full backup, the next best thing is the full backup you took before that, F1. And to use that full backup, you’ll also need the most recent differential backup that was based on that full backup, D3, plus log backups L7 and L8, to get you up to the point in time around F2, and then log backups L9 to L16 (because differential backups D4-D6 are based off F1, not F2, and so cannot be used).

As you can see, having a corrupt full backup also means having to restore a lot more backups, which will likely blow your downtime requirement, but that should be a rare case and an exception should be able to be made. You just need to make sure you have the older backups.

So, again, how long do you keep your backups for?

If I had a backup strategy that was weekly full backups, daily differential backups, and log backups every 30 minutes, I would try to keep at least a month’s worth of backups at all times onsite, with three months of backups stored offsite. Do you think that sounds excessive? Unless you’re testing a multitude of restore sequences every week using all the older backups, which is rarely possible, you need to have ‘defense in depth’ – multiple possible alternate restore strategies if multiple older backups prove to be damaged or invalid.

Summary

This was the last post in our Accidental DBA series about SQL Server backups. Over the last six posts I’ve explained:

  • What the downtime (RTO) and data loss (RPO) requirements are, and why they’re critical to know for your databases
  • The three recovery models that can be used, and how they affect the log and log backups
  • The three main types of backups (full, differential, and log) and what they’re used for
  • How to design a restore strategy that takes into account the RPO and RTO requirements, which then lets you decide what backups you need to perform
  • Commonly used BACKUP options and why they’re useful
  • The importance of validating backups to make sure that when you need them, they’re usable
  • The importance of storing multiple copies of backups, both onsite and offsite, so a disaster doesn’t result in complete data loss
  • Finally, the importance of retaining older backups so your disaster recovery strategy can cope with some backups being invalid

I hope this has given you a good grounding in the importance of having a good, well-tested backup strategy. Protecting the data from loss and damage is really one of the most important things that a DBA does.