Incomplete checkpoints and recovery

Back in 2009 I blogged about how checkpoints work (see How do checkpoints work and what gets logged) and I received a question in email on Monday that I thought would make a good little blog post.

The question is (paraphrased): What happens if a checkpoint starts but doesn’t finish before a crash occurs? Will that checkpoint be used for crash recovery?

The answer is no, it won’t. Now if I left it at that, that really would be a little blog post, so let me explain my answer :-)

The purpose of a checkpoint is to bring the pages in the data files up-to-date with what’s in the transaction log. When a checkpoint ends, there’s a guarantee that as of the LSN of the LOP_BEGIN_CKPT log record, all changes from log records before that point are persisted in the data files on disk. There’s no guarantee about logged changes after that point, only before it. In other words, all the log records before the LSN of the LOP_BEGIN_CKPT log record are no longer required for crash recovery, unless there’s a long running transaction that started before that LSN.

When the checkpoint ends, the boot page of the database (page 9 in file 1 – see here for some more info) is updated with the beginning LSN of the checkpoint (and then if in the SIMPLE recovery mode, any log clearing/truncating can occur).

So if a checkpoint started but didn’t end before a crash, it’s LSN would not be in the boot page and so crash recovery would start from the previous checkpoint. This is good, because an incomplete checkpoint means there’s no guarantee about which logged changes are persisted in the data files, and so crash recovery wouldn’t be able to work correctly from only starting at the beginning of the incomplete checkpoint.

A corollary question could be: How does SQL Server guarantee that there’s always one complete checkpoint in the active portion of the log, in case a crash occurs?

The answer is that log clearing/truncation of a VLF containing an LOP_BEGIN_CKPT log record cannot happen until another complete checkpoint has occurred. In other words, a complete checkpoint has to occur since the last log clearing/truncation before the next one can happen. If a checkpoint hasn’t occurred, the log_reuse_wait_desc for the database in sys.databases will return CHECKPOINT. It’s not common to see this occur, but you might see it if there’s a very long running checkpoint (e.g. a very large update on a system with a slow I/O subsystem so the flushing of data file pages takes a long time) and very frequent log backups, so two log backups occur over the time taken by a single checkpoint operation. It could also happen if you’ve messed with the sp_configure recovery interval and set it higher than the default.

Interesting, eh?

A SQL Server DBA myth a day: (15/30) checkpoint only writes pages from committed transactions

(Look in the Misconceptions blog category for the rest of the month’s posts and check out the 60-page PDF with all the myths and misconceptions blog posts collected together when you join our Insider list, plus my online Myths and Misconceptions training course on Pluralsight.)

My DBA Mythbusters session went *really* well at SQL Connections yesterday, and now I’m half way through my debunking month!

Myth #15: checkpoint only writes pages from committed transactions.

FALSE

This myth has persisted for *ages* and is linked to a misunderstanding of how the overall logging and recovery system works. A checkpoint always writes out all pages that have changed (known as being marked dirty) since the last checkpoint, or since the page was read in from disk. It doesn’t matter whether the transaction that changed a page has committed or not – the page is written to disk regardless. The only exception is for tempdb, where data pages are not written to disk as part of a checkpoint. Here are some blog post links with deeper information:

You can watch what a checkpoint operation is doing using a few trace flags:

  • 3502: writes to the error log when a checkpoint starts and finishes
  • 3504: writes to the error log information about what is written to disk

To use these trace flags, you must enable them for all threads using DBCC TRACEON (3502, 3504, -1) otherwise you won’t see any output.

Here’s a quick script that proves that dirty pages from uncommitted transactions are written out during a checkpoint. First the setup:

CREATE DATABASE [CheckpointTest];
GO
USE [CheckpointTest];
GO
CREATE TABLE [t1] ([c1] INT IDENTITY, [c2] CHAR (8000) DEFAULT 'a');
CREATE CLUSTERED INDEX [t1c1] on [t1] ([c1]);
GO

SET NOCOUNT ON;
GO

CHECKPOINT;
GO

DBCC TRACEON (3502, 3504, -1);
GO

And now an uncommitted transaction that causes 10MB of pages to be dirtied, followed by a checkpoint:

BEGIN TRAN;
GO
INSERT INTO [t1] DEFAULT VALUES;
GO 1280

CHECKPOINT;
GO

And in the error log I see:

2010-04-15 13:31:25.09 spid52      DBCC TRACEON 3502, server process ID (SPID) 52. This is an informational message only; no user action is required.
2010-04-15 13:31:25.09 spid52      DBCC TRACEON 3504, server process ID (SPID) 52. This is an informational message only; no user action is required.
2010-04-15 13:31:25.09 spid52      Ckpt dbid 8 started (0)
2010-04-15 13:31:25.09 spid52      About to log Checkpoint begin.
2010-04-15 13:31:25.09 spid52      Ckpt dbid 8 phase 1 ended (0)
2010-04-15 13:31:25.71 spid52      FlushCache: cleaned up 1297 bufs with 50 writes in 625 ms (avoided 0 new dirty bufs)
2010-04-15 13:31:25.71 spid52                  average throughput:  16.21 MB/sec, I/O saturation: 70
2010-04-15 13:31:25.71 spid52                  last target outstanding: 2
2010-04-15 13:31:25.71 spid52      About to log Checkpoint end.
2010-04-15 13:31:25.71 spid52      Ckpt dbid 8 complete

Clearly all the pages were written out, even though the transaction had not committed.

Interesting case of watching log file growth during a perf test

I'm running some performance tests on the hardware we have (more details on the first of these tomorrow) and I was surprised to see some explosive transaction log growth while running in the SIMPLE recovery model with single row insert transactions!

Without spoiling tomorrow's thunder too much, I've got a setup with varying numbers of connections populating a 1TB table with default values, with the single data and log files on the same 8-drive RAID-10 array (again more details on why tomorrow). I was looking at perfmon while the test was running with 128 concurrent connections, and noticed some weird-looking behavior – basically what looked like a never-ending checkpoint occuring. So I dug in deeper and discovered the transaction log had grown over 7GB since the start of the test. Huh?

I restarted the test from scratch, with a 1TB data file and a 256MB log and watched what happened in perfmon; here's the screenshot of the first 90 seconds:

 

This is *so* interesting. The black line is the number of log growths, so you can see the log grows every time the line level goes up. The bright blue line is the number of pages being allocated per second to hold all the table rows my 128 connections are inserting. You can clearly see that every time there's a log growth, the allocations take a nose-dive – because no transactions can commit while the new portion of the log is being zeroed (remember that instant file initialization does not apply to the log – see this blog post). The green line (disk write bytes per second) spikes when the log grows because of the zeroes being written out to the log file by SQL Server (remember that SQL Server does the zeroing).

But why is the log growing in the first place? I'm in the SIMPLE recovery model and doing single-row implicit transactions (yes, I'm deliberately doing this) so there shouldn't be anything stopping the log from clearing during a checkpoint, right?

Wrong. The log starts out small (256MB) so one of the thresholds for triggering a checkpoint gets hit pretty fast (70% of the log file is used). So a checkpoint occurs (you can see the checkpoints occuring when the pink line at the bottom of the perfmon screen is above zero), and starts writing out to the data file, which is on the same disk as the log (see How do checkpoints work and what gets logged for an explanation of checkpoint mechanics), but it can't write fast enough (because of disk contention) to get to the log-clearing part before the log fills up completely and has to grow (because transactions are continuing at break-neck speed). So the log grows, and the insert transactions stop while the log is zeroed. And then the log starts to fill up again very quickly and another checkpoint is triggered, and so on and so on.

Eventually a steady state is reached where there's enough free log space during a checkpoint that no new log is required for the concurrent transactions to commit. You might ask why the checkpoint is so slow in the first place? Because I deliberately put the data file on the same RAID array as the log file, and both are being steadily hammered with writes – classic disk contention. Even though the RAID array is RAID-10 with 8x300GB 15k SCSI drives, the average disk queue write length is over 20 most of the time during the 128-way test because I'm simply trying to do too much.

The point of my blog post? Just because you don't have any of the classic causes of transaction log growth going on, doesn't mean you're going to be immune. In this case my (deliberate) poor physical layout of the database files and workload growth up to 128 concurrent connections caused the log to grow. What started out working when I was running 16 connections didn't work any more at 128 (actually I went back and re-ran some of the earlier tests and even with only 64 connections, the log grew to over 1GB before reaching steady-state).

Interesting eh?

Tomorrow I'll be continuing the perf/benchmarking series by creating my first benchmark and then tweaking the setup to see how I can improve performance (for instance with multiple data files, separation of log and data files – all the things I preach but have never *demonstrated*), but this behavior merited a post all on its own.