In my previous posts on FILESTREAM I discussed the directory structure of the FILESTREAM data container and how to map the directories to database tables and columns. In this post I'm going to explain how and when the FILESTREAM garbage collection process works as that doesn't seem to be documented anywhere (even in the FILESTREAM whitepaper I wrote for MS - it wasn't supposed to be that low-level). There seems to be a lot of confusion about how updates of FILESTREAM data work, and when the old versions of the FILESTREAM files are removed. I'm going to explain how it all works and then show you by example.

The basic behavior that is non-intuitive is that there's no such thing as a partial update of FILESTREAM data. If you have 10MB of data stored in a FILESTREAM column (and hence have a 10MB FILESTREAM file), then updating even a single byte of it will result in a whole new 10MB FILESTREAM file. Anything that relies on having an up-to-date version of the database (e.g. log backups, log-shipping, replication) will pick up the entire new 10MB FILESTREAM file. Every time an update is made to that data, a new 10MB FILESTREAM file is created and then subsequently backed-up, replicated, etc. This can lead to unexpectedly large log backups, or network traffic between replication nodes.

Once you realize that new versions of the FILESTREAM files are going to be created, the obvious follow-on question is: when do the old versions get removed? The answer is: it depends! Smile

The old versions are removed by a process called garbage collection - in much the same way that memory garbage collection runs for managed code and deallocates object instantiations that are no longer referenced by any variables. The key point is that nothing needs the object instantiation any more; otherwise the memory garbage collection would be corrupting the run-time memory of the managed code application. The same principle applies for FILESTREAM garbage collection - the old versions of the FILESTREAM files cannot be removed until they are no longer needed.

But what does 'no longer needed' mean for FILESTREAM files? Well, it's kind of the same as for transaction log records. An old version of a FILESTREAM file is no longer needed if the transaction that created it has committed or rolled back, AND there are no other technologies that must read it, like a log backup (when running in the FULL or BULK_LOGGED recovery models), or the transactional replication log reader. In fact, the transaction log VLF containing the log record of the creation of the FILESTREAM data file must be switched to inactive before the FILESTREAM file can be garbage collected. Note that I don't mention database mirroring - in SQL 2008 database mirroring and FILESTREAM cannot be used together.

Once the old FILESTREAM file is no longer needed, it is available for garbage collection. How does the garbage collection process know which FILESTREAM files to physically delete? The answer is that when the file is no longer needed, an entry is made in a special table called a 'tombstone' table. The garbage collection process scans the tombstone tables and removes only the FILESTREAM files with an entry in the tombstone table. You can read more about the tombstone tables in this blog post from the CSS blog.

So when does the garbage collection process actually run? It can't be part of log backups, because in the SIMPLE recovery model, you can't take log backups. The answer is that it runs as part of the database checkpoint process. This is what causes some confusion - an old FILESTREAM file will not be removed until after it is no longer needed AND a checkpoint runs.

Now let's see this stuff in action. I'm going to create a database with FILESTREAM data in and then play around with transactions, log backups, and checkpoints to show you garbage collection working.

CREATE DATABASE FileStreamTestDB ON PRIMARY
    (NAME = FileStreamTestDB_data,
    FILENAME = N'C:\Metro Demos\FileStreamTestDB\FSTestDB_data.mdf'),
FILEGROUP FileStreamFileGroup CONTAINS FILESTREAM
    (NAME = FileStreamTestDBDocuments,
    FILENAME = N'C:\Metro Demos\FileStreamTestDB\Documents')
LOG ON
    (NAME = 'FileStreamTestDB_log',
    FILENAME = N'C:\Metro Demos\FileStreamTestDB\FSTestDB_log.ldf');
GO

USE FileStreamTestDB;
GO

CREATE TABLE FileStreamTest1 (
    DocId UNIQUEIDENTIFIER ROWGUIDCOL NOT NULL UNIQUE,
    DocName VARCHAR (25),
    Document VARBINARY(MAX) FILESTREAM);
GO

Now I'm going to put the database into the FULL recovery model and take a full database backup - which means I must now take log backups to manage the size of the transaction log. It also means that a FILESTREAM file cannot be removed until it has been backed up.

ALTER DATABASE FileStreamTestDB SET RECOVERY FULL;
GO
BACKUP DATABASE FileStreamTestDB TO DISK = 'C:\SQLskills\FSTDB.bak';
GO

Now I'm going to create some FILESTREAM data.

INSERT INTO FileStreamTest1 VALUES (
    NEWID (), 'Paul Randal',
    CAST ('SQLskills.com' AS VARBINARY(MAX)));
GO

Looking in the FILESTREAM data container I created, I have the following file:

 

Remember from the previous blog posts, that the FILESTREAM file filenames are the database log sequence number at the time they were created. Now I'll update the value in an implicit transaction (no BEGIN TRAN and COMMIT TRAN).

UPDATE FileStreamTest1
    SET Document = CAST (REPLICATE ('Paul', 2000) AS VARBINARY(MAX))
    WHERE DocName LIKE '%Randal%';
GO

and we now have the following files:

 

The new file is the 8KB file and the old FILESTREAM value is the 1KB file. If I try doing an explicit CHECKPOINT, nothing changes as the old file is still required as it hasn't yet been backed up. Now I'll do a log backup.

BACKUP LOG FileStreamTestDB TO DISK = 'C:\SQLskills\FSTB_log.bak';
GO

And the files are all still there. Although the first 1KB file is no longer needed, a checkpoint hasn't occurred yet, so garbage collection hasn't run. Now running an explicit CHECKPOINT, the directory still contains the two files. What happened? The transaction log VLF containing the log record for the creation of the FILESTREAM file is still active, so the file is still needed. I have to do *another* log backup and checkpoint before garbage collection kicks in (as that will cause the log to cycle, when there's nothing happening in the database and no active transactions) and the directory view changes to:

 

The alternative would have been to generate more log records, spilling into the next transaction log VLF, then do another log backup which would mark the 'creation' VLF inactive, and then the next checkpoint would run garbage collection on the file. This, of course, would be the normal course of events in a production database.

So, don't get confused if you update a FILESTREAM file, then do a log backup and checkpoint and nothing happens. Remember the transaction log has to have progressed enough for the 'creation' VLF to be inactive too. You can prove this to yourself by creating an explicit transaction at the same time as the FILESTREAM update (in another, implicit transaction). No matter how many times you backup the log and checkpoint the database, the garbage collection will not run until the explicit transaction is committed or rolled back, and then another log backup and checkpoint is run.

I'll leave it as a fun exercise for you to play around with updates in explicit transactions and various backup scenarios to see when garbage collection can and cannot remove old files, but now you know exactly how it works.

A short post to start the day (in India) and then at lunchtime I'll do how it works: FILESTREAM garbage collection.

This is a question that came up recently on an MVP discussion list: why does a 200MB log backup take 40 minutes to restore on my log-shipping secondary?

First answer I thought of was that the I/O subsystem on the log-shipping secondary was very slow, or broken in some way, so I/Os were taking ages to complete. Nope - I/O subsystem was performing well.

Next thing I suggested was the the log-shipping secondary was restoring the log backups using WITH STANDBY and there was something like an index rebuild that was being rolled back. As an aside, when you restore log backups on the log-shipping secondary, you have a choice how they recovery: either WITH NORECOVERY or WITH STANDBY. The first option doesn't allow any access to the database, as the database is still "in recovery". The second option runs the REDO part of recovery, then runs the UNDO part of recovery, but saves the details of what it had to do for UNDO into a file (who's name and location you specify). It then allows read-only access to the database, for queries/reporting/whatever. Obviously if there's a lot of transaction log that has to be undone (i.e. rolling back transactions that weren't committed at the time the log backup completed), then this could take some time. But 40 minutes? No. That wasn't it.

I was partly right on my second guess. The *previous* log backup that was restored WITH STANDBY contained a long-running index operation, and so the undo file that the restore created was *huge*. The next log backup that's restored after a RESTORE ... WITH STANDBY, must first undo everything in the undo file (i.e. put the database back to the exact state it was in as if the WITH STANDBY part of the restore never happened) before it can restore more transaction log. In this case, no-one had noticed that the previous log restore *also* took a lot of time and created the huge undo file. It just looked like the 200MB log backup was causing the problem.

My advice: in any kind of there's-suddenly-a-corruption-or-performance-problem situation, don't just look at what's happening now. Always look at what happened leading up to the problem, as the answer is often there.

Every so often (well, ok, very frequently) my internal list of 'I must blog about that' gets too big and I have an almost uncontrollable urge. Luckily, these urges end up with me blogging several times, otherwise who knows what kind of trouble I'd get into? Smile

First up today is some pretty deep internals about how the transaction log works in a specific situation. This has come up a few times in the last few internals/maintenance classes I've taught in the transaction log module so I wanted to get a blog post out there that proves my answer is correct. The question is easier to frame with a picture, borrowed from my TechNet Magazine article from February (see here).

  

The image shows the circular natue of the transaction log, and how VLFs can be overwritten once the log has wrapped around, as long as they are no longer active (red = active log/VLF, blue = inactive log/VLF). If none of this makes sense to you, go read the TechNet Magazine article and it should do.

The question is: what happens if log records continue being written to VLFs 1 and 2 in the picture, and eventually the end of VLF 2 has been reached, but VLF 3 is still active, and can't be overwritten? Does the database just stop?

Let's find out.

I'm going to create a database and then engineer the situation above.

CREATE DATABASE LogTestDB ON PRIMARY (
    NAME = LogTestDB_data,
    FILENAME = N'C:\SQLskills\LogTestDB_data.mdf')
LOG ON (
    NAME = 'LogTestDB_log',
    FILENAME = N'C:\SQLskills\LogTestDB_log.ldf',
    SIZE = 5MB);
GO

The database is in the SIMPLE recovery model. Well, to be completely honest, it's in the FULL recovery model, but there hasn't been a full database backup yet, so as far as transaction log behavior is concerned, it behaves as if it was in SIMPLE (the log can be cleared by a CHECKPOINT) - again, if this makes no sense, go read the TechNet Magazine article. 

And then see how many VLFs it has using the DBCC LOGINFO command (and this is the *only* way to see how many VLFs there are):

DBCC LOGINFO ('LogTestDB');
GO

FileId   FileSize   StartOffset   FSeqNo   Status   Parity CreateLSN
-------- ---------- ------------- -------- -------- ------ -----------
2        1245184    8192          31       2        64     0
2        1245184    1253376       0        0        0      0
2        1245184    2498560       0        0        0      0
2        1499136    3743744       0        0        0      0

We've got 4 VLFs. The Status column tells us whether the VLF is active or not. A status of 2 means the VLF is active, 0 means it's not. The sequence number (FSeqNo) is the logical order of the VLFs within the log. The FileSize is in bytes (so each VLF is about 1.25 MB). Right now there's only one active VLF.

Now I'll engineer the situation in the image above. I'm going to fill the log so that VLFs 1, 2, and 3 are active. Then I'm going to start an explicit transaction that will hold VLF 3 and onwards active. Then I'll continue filling the log so it wraps around and starts to fill up VLF 1 again.

USE LogTestDB;
GO
CREATE TABLE BigRows (c1 INT IDENTITY, c2 CHAR (8000) DEFAULT 'a');
GO

SET NOCOUNT ON;
INSERT INTO BigRows DEFAULT VALUES;
GO 300

I've filled up VLFs 1 and 2, and started on 3. Let's check with DBCC LOGINFO:

DBCC LOGINFO ('LogTestDB');
GO

FileId   FileSize   StartOffset   FSeqNo   Status   Parity CreateLSN
-------- ---------- ------------- -------- -------- ------ -----------
2        1245184    8192          31       2        64     0
2        1245184    1253376       32       2        64     0
2        1245184    2498560       33       2        64     0
2        1499136    3743744       0        0        0      0

As you can see, the first 3 VLFs now have a status of 2 (active), and they're in sequence. Now I'll create an explicit transaction that will prevent VLF 3 and onwards being cleared.

BEGIN TRAN
INSERT INTO BigRows DEFAULT VALUES;
GO

If I explicitly do a CHECKPOINT now, VLFs 1 and 2 will clear:

CHECKPOINT
GO

DBCC LOGINFO ('LogTestDB');
GO

FileId   FileSize   StartOffset   FSeqNo   Status   Parity CreateLSN
-------- ---------- ------------- -------- -------- ------ ----------
2        1245184    8192          31       0        64     0
2        1245184    1253376       32       0        64     0
2        1245184    2498560       33       2        64     0
2        1499136    3743744       0        0        0      0

Now I'll carry on filling up the BigRows table so the log wraps around and starts filling up VLFs 1 and two again.

INSERT INTO BigRows DEFAULT VALUES;
GO 300

DBCC LOGINFO ('LogTestDB');
GO

FileId   FileSize   StartOffset   FSeqNo   Status   Parity CreateLSN
-------- ---------- ------------- -------- -------- ------ ----------
2        1245184    8192          35       2        128    0
2        1245184    1253376       32       0        64     0
2        1245184    2498560       33       2        64     0
2        1499136    3743744       34       2        64     0

You can see that the log has wrapped around now, but VLFs 3 and 4 are still active. Look at the sequence numbers of the active VLFs... the active log is VLF 3 then 4 then 1, with sequence numbers 33 then 34 then 35. Now if I carry on filling up the table, what's going to happen when the log bumps up against VLF 3 that is still active? Will it stop or will it grow?

INSERT INTO BigRows DEFAULT VALUES;
GO 300

DBCC LOGINFO ('LogTestDB');
GO

FileId   FileSize   StartOffset   FSeqNo   Status   Parity CreateLSN
-------- ---------- ------------- -------- -------- ------ ------------------
2        1245184    8192          35       2        128    0
2        1245184    1253376       36       2        128    0
2        1245184    2498560       33       2        64     0
2        1499136    3743744       34       2        64     0
2        253952     5242880       37       2        64     36000000049300052
2        270336     5496832       38       2        64     36000000049300052
2        253952     5767168       0        0        0      36000000107500066
2        335872     6021120       0        0        0      36000000107500066
2        253952     6356992       0        0        0      36000000190700020
2        401408     6610944       0        0        0      36000000190700020
2        253952     7012352       0        0        0      37000000037300040
2        466944     7266304       0        0        0      37000000037300040

The answer is that it grew, and kind of skipped the active VLFs! Look at the sequence numbers. The new sequence of the active log is VLF 3 then 4 then 1 then 2 then 5 then 6, as you can see from the sequence numbers. Once the active transaction I created commits or rolls back, VLFs 3, 4, 1, and 2 can clear and then the 'normal' sequencing of VLFs in the log will resume.

Every VLF has a small header which contains the sequence number of the VLF within the transaction log, so the log can kind of do contortions to work around active VLFs in the middle of the log. Very cool.

Ok - that was fun - now I feel better!

PS In the last couple of DBCC LOGINFO dumps, where the log has wrapped around, you can see that the parity bits for the log blocks in the VLFs have changed, as I explained in one of my previous posts Search Engine Q&A #24: Why can't the transaction log use instant initialization?.

Theme design by Nukeation based on Jelle Druyts