Here's an interesting bug that was just recently fixed, and someone actually hit it today, so it's time to blog it.

Bottom line is that if you're doing NOLOCK scans of tables involving LOB data, the perfmance might suck. 2005 SP3 CU3 has the bug fix, but unfortunately the KB article was written by someone who didn't seem to know what they were talking about, so I suggested it was rewritten, which it has been (but not republished yet). The original KB article is at http://support.microsoft.com/kb/961049/. It's also fixed in 2008, but I don't know which build (I believe CU4 at least).

Here's my explanation of the problem.

LOB data can be stored in-row or off-row (my previous LOB post Importance of choosing the right LOB storage technique has more details). When it's stored off-row, it must be accessed by first reading the pointer to the LOB data from the data/index record, and then following the pointer (remember that 'record' is synonymous with 'row'). When a LOB data value is updated, the off-row value is updated first, and then if the off-row link changed, the data/index record is updated with the new link. There's obvioulsy a window here, where someone reading the data/index record might see the wrong/non-existent off-row pointer. This is exactly what can happen when a NOLOCK scan occurs.

To mitigate the possibility of a NOLOCK scan trying to follow a bad off-row link, the old behavior was to scan all the IAM pages (see Inside the Storage Engine: IAM pages, IAM chains, and allocation units) for the table/index to make sure that the off-row link actually pointed to a page allocated to the table/index. If there are lots of IAM pages, this means lots of logical IOs, and poorly performing SELECT queries. And it does the scan once for *every row*. The person that hit it today had a 500 row select of ~20KB per row taking 20 seconds - 10MB of physical IOs and 30MB of logical IOs!

The fix is to make further use of an already existing in-memory cache of IAM pages to do a quick lookup of the right IAM page covering the GAM interval of the LOB page being read, without having to scan the whole IAM chain.

Hope this helps explain things, and track down perf problems for some of you.

(Thanks to David Baffaleuf for pointing this out to me)

Ever wanted to know a bit deeper how memory works in your laptop or servers? This comprehensive series of blog posts will give you a great overview. It's very clearly written and goes very deep. At times it talks about how things show up in the Linux OS, but all the hardware details and concepts (caches, NUMA, etc) translate exactly into our world.

It covers:

  • Part 1 (Introduction to memory)
  • Part 2 (CPU caches)
  • Part 3 (Virtual memory)
  • Part 4 (NUMA systems)
  • Part 5 (What programmers can do - cache optimization)
  • Part 6 (What programmers can do - multi-threaded optimizations)
  • Part 7 (Memory performance tools)
  • Part 8 (Future technologies)
  • Part 9 (Appendices and bibliography)

[Edit: And as Kalen points out in her comment below, you can get the whole thing as a PDF here.]

Enjoy!

PS This is something I'll start posting on in the rest of the year too, as far as SQL Server is concerned.

Categories:
Memory

I was in a discussion earlier today, and it's one I've had lots of times in the past - both inside and outside Microsoft and the SQL team: why doesn't SQL Server do automatic <defrag, index creation, refactoring, de/normalization, backups, CHECKDBs, etc>?

Some of these were considered while I was still on the team, and I was *very* cautious about them. Here's why, taking automatic table de/normalization as an example (this morning's discussion). Assuming the logic to do the de/normalization based on usage patterns can be worked out, here's why I don't think it will ever happen (just off the top of my head):

  • If it's wrong even once, no-one will ever use it again.
  • When should SQL Server change the table definitions? What time of day is best?
  • Where should the new table be placed? In which filegroup?
  • What if there's no space to do it - should the database autogrow because of an automatic process? And then should it shrink afterwards?
  • How far should the change plan parallelize?
  • Who manages the transaction log size if the table is really large? Automatically take log backups? What if there's no space for the log backups? What about space on the log shipping secondaries? And the time required to roll-forward the log on the log-shipping secondaries?
  • What about the impact of extra transaction log on database mirroring? Both in terms of the huge SEND queue preventing log truncation, and the huge REDO queue preventing the mirror coming online quickly?
  • What if a cluster failover occurs during the operation and the rollback prevents the database coming online within the company's RTO?
  • What if there are an replication topologies setup? How do they get quiesced for the schema change?
  • How does the app get changed to handle the different schema, or do views get created automatically? Indexed views or not?
  • What if Change Data Capture is running? Should it automatically create a new capture instance? What if both are already in use?
  • How do SPs get updated to the new schema? Build in the datadude engine as part of SQL Server?

My point: it's a lot harder than you think to just put some automatic behavior into SQL Server. 

Now, saying that, I hope that some of the others do happen at some point, but with lots of config parameters so I can control them.

Categories:
General

People are saying they can't find me - that's a Twitter issue with indexing. Here's the direct link: http://twitter.com/PaulRandal and Kimberly's at http://twitter.com/KimberlyLTripp

Categories:
General

After being mercilessly pressurized by many people (and I blame Jason Massie the most), I've joined Twitter and I'll be commenting lots on it. Why did I finally cave? Well, I've been seeing lots of little things in forums and newsgroups that don't merit a full blog post but are worth letting people know about. So - feel free to follow along. My username is 'PaulRandal' - original, huh? I'm sure I'll get just as sucked into Twitter as I have to FaceBook - luckily I can type fast :-)

Enjoy!

PS You might not find me in Twitter search until I've posted a few times - try http://search.twitter.com/

Categories:
General

In this week's survey I'm interested in what kind of backups you take, and also what recovery model you mostly use. If you have multiple database with different strategies, by all means respond multiple times. The more responses the better! I'll report on the survey results some time over the weekend of 5/2/2009.

A couple of notes on the survey:

  • If you're using BULK_LOGGED at all, just answer as FULL.
  • When I say full backup, I mean either full database, full filegroup, or full file backup. Same for differential.
  • I don't care whether you use a tool outside SQL Server to take your SQL backups (e.g. LiteSpeed) - it's the type of backup that's interesting.

Thanks!

PS No comments on this post please (I'll delete them straight away) - please wait for the survey results post to avoid skewing the answers.

A couple of weeks ago I kicked off the latest survey, on what the physical layout of your databases are and why you have them that way (see here for the survey). I let the survey run for a while to get a good sampling, and I wasn't disappointed, with over 1000 responses! Here are the results as of 4/27/2009.

Just like any other 'best practice' kind of topic, the question of how to design the physical layout of a database provokes a lot of (sometimes heated) discussion. There are lots of options and there are even more factors to consider - so the best answer is my perennial favorite "it depends"! In this post, I don't want to tell you how I think you should layout your database - instead I want to discuss some of the options and let you make up your own mind, with the added benefit of data on what your peers are doing with their databases. The main point of this survey was to see what people are doing, rather than as a driver for an editorial blog post.

As you can clearly see from the results above, and predictably, the distribution of layout types shifts as the database size increases - but I was very surprised by the number of single file databases over 10GB. Rather than go through each option in the survey, I'm going to talk a bit about some of the things to consider when planning a layout.

Underlying I/O subsystem

This could be the most important factor to consider. If you only have a single physical drive, for instance, there's arguably not much point creating multiple data files, as that will force the disk heads to bounce back and forth to the different file locations on the disk. On the other hand, if you have a SAN with several thousand drives grouped together into multiple LUNs, your possibilities are a lot wider (and maybe much harder to come up with the optimal layout). Several people asked if I'd go into depth around having multiple controllers, and different drive layouts in a SAN - and my answer is no. I'm not an expert at storage design, which, like indexing, is both an art and a science. There's a good whitepaper that discusses some of this: Physical Database Storage Design, which I helped review back in my MS days.

Performance, recoverability, manageability

Having multiple files with separate storage for each allows reads and writes to be parallelized for increased performance, lowering the amount of disk head contention. When a checkpoint occurs, and pages are written to disk, spreading the I/O load over multiples files can speed up the checkpoint and reduce the IOPS spikes that you may see. It can also lead to reduced contention for the various allocation bitmaps - in the same way as I've described for tempdb. In user databases with a very high rate of allocations, contention can arise on the GAM pages - but it's not common. Some people also advocate having separate filegroups for tables and indexes, and although this can sometimes be more trouble than it's worth, and often turns into a religious debate, I have heard of people getting a perf boost from this.

One of the most convincing reasons (I find) for having multiple filegroups is the ability to do much more targeted recovery. With a single file database, if it gets corrupted or lost, you have to restore the whole database, no matter how large the file is - and this can seriously affect your ability to recover within the RTO (Recovery Time Objective) agreement. By splitting the database into multiple filegroups, you can make use of partial database availability and online piecemeal restores (in Enterprise Edition) to allow the database to be online as soon as the primary filegroup is online, and then restore the remaining filegroups in priority order - bringing the application online as soon as the relevant filegroups are online. You can even use this layout to spread your backup workload - moving to filegroup-based backups instead of database backups, although this isn't very common.

As far as manageability is concerned, there are a few reasons to have multiple filegroups. Firstly, you can isolate a table that requires a lot of I/O (e.g. in terms of index maintenance) on separate storage from other tables, so that maintenance operations (and the I/O overhead of doing them) doesn't interfere with the I/O of the other tables. Also, you can provision different kinds of storage for different tables - in terms of disk speed and RAID level (redundancy), for instance. If you want to be able to move data around, you can do it much more easily if the database is split up, than if it's a single file.

Summary 

Ok - so I lied. I *am* going to offer advice - against one of the options: single filegroup, single file. For smaller databases, this is fine - but as the database size gets larger, say, over tens of GB, then having a single file can become a serious liability. With a single file database (or even a single filegroup database), you lose most of the benefits mentioned above.

Bottom line - as your databases get larger, you're going to need to think more carefully about their layout, otherwise you could run into big problems as your workload increases or when disaster strikes. As the survey results show, this is what your peers are doing.

Next post - this week's survey!

Finally! All those customers out there who are frustrated by not being able to turn on the 'lock pages in memory' option to protect the SQL working set on Standard Edition - the wait is nearly over. Bob Ward (Principal Escalation Engineer in CSS, and a good friend of mine) announced at PASS Europe that the option is soon to be supported - in the May CU for SQL 2008 and the June CU for SQL 2005. You can read a little more in his blog entry here.

If you don't know what this is, checkout this entry in Books Online - it basically prevents Windows paging out SQL memory to disk.

[Edit: the bits containing this fix have been released for 2008. See here for details.]

There's been a recent flurry of confusion and misconceptions about trace flag 1118 in SQL Server 2008. This trace flag switches allocations in tempdb from single-page at a time for the first 8 pages, to immediately allocate an extent (8 pages). It's used to help alleviate allocation bitmap contention in tempdb under a heavy load of small temp table creation and deletion.

There are multiple points of confusion, which I'll address in turn. Then I'll prove that the trace flag still works in SQL Server 2008.

1) Why is the trace flag usually required in 2000? In SQL 2000, whenever a temp table is created in tempdb and a row inserted, an IAM page must be allocated and a single data page must be allocated. These two pages are both 'single-page' allocations, from a mixed extent (see Inside The Storage Engine: Anatomy of an extent for more info). This means that an SGAM allocation bitmap page must be accessed, and a PFS page must be accessed (see Inside The Storage Engine: GAM, SGAM, PFS and other allocation maps for more info).

With lots of very small temp tables being created, this means the very first SGAM page and the very first PFS page in the data file are accessed/changed by all the threads, leading to latch contention problems on these two pages. When the temp tables are deleted again, the various pages are deallocated, which again needs to access and change the PFS page, and potentially the SGAM page.

There are two ways to alleviate this problem. Firstly, create multiple data files in tempdb - which splits the latch contention over multiple allocation bitmaps (from having allocations come from multiple files) and thus reduces the contention. The general rule of thumb was one tempdb data file for each processor core. Secondly, turn on TF1118, which makes the first 8 data pages in the temp table come from a dedicated extent. This means one extent is allocated from the GAM page, rather than 8 single pages (and potentially 8 accesses to the SGAM page). The pages within the extent are reserved and allocated singly from this extent, as needed. This also cuts down on contention and is documented in KB 328551.

2) What does reserved vs. allocated mean? When an extent is allocated to a table, the 8 pages in the extent are not immediately allocated as well. Allocating an extent means those 8 pages are reserved exclusively for subsequent allocation to that table. The pages are allocated individually as needed, but no other table can allocate them. This is why such extents are called 'dedicated' extents (see my blog post link above for more details). You can see the counters of reserved pages vs. allocated pages in the output from sp_spaceused.

3) Why is the trace flag not required so much in 2005 and 2008? In SQL Server 2005, my team changed the allocation system for tempdb to reduce the possibility of contention. There is now a cache of temp tables. When a new temp table is created on a cold system (just after startup) it uses the same mechanism as for SQL 2000. When it is dropped though, instead of all the pages being deallocated completely, one IAM page and one data page are left allocated, and the temp table is put into a special cache. Subsequent temp table creations will look in the cache to see if they can just grab a pre-created temp table 'off the shelf'. If so, this avoids accessing the allocation bitmaps completely. The temp table cache isn't huge (I think it's 32 tables), but this can still lead to a *big* drop in latch contention in tempdb.

4) Does the trace flag still exist in 2005 and 2008? Yes it does - KB 328551 clearly states:

Note Trace flag -T1118 is also available and supported in Microsoft SQL Server 2005 and SQL Server 2008. However, if you are running SQL Server 2005 or SQL Server 2008, you do not have to apply any hotfix.

Just to make extra-sure (as I'm always paranoid about saying absolutes), I checked with my good friend Ryan Stonecipher, who's the dev lead for the team that owns allocation (and a bunch of other stuff, including DBCC). He confirmed the code is exactly the same in 2008 as it was in 2005. And I prove it to you below too.

5) And why is it still there in 2005 and 2008? It does the same thing in 2005/2008 as it did in 2000. If the temp table creation/deletion workload is high enough, you can still see latch contention, as the temp table cache won't be enough to completely alleviate the need for creating actual new temp tables, rather than just being able to grab one 'off the shelf'. In that case, using the trace flag to change to extent-based allocation (in *exactly* the same way as for 2000) can help, as can creating more tempdb data files.

As far as data files go though, the number has changed. Instead of a 1-1 mapping between processor cores and tempdb data files (*IF* there's latch contention), now you don't need so many - so the recommendation from the SQL team is the number of data files should be 1/4 to 1/2 the number of processor cores (again, only *IF* you have latch contention). The SQL CAT team has also found that in 2005 and 2008, there's usually no gain from having more than 8 tempdb data files, even for systems with larger numbers of processor cores. Warning: generalization - your mileage may vary - don't post a comment saying this is wrong because your system benefits from 12 data files. It's a generalization, to which there are always exceptions.

6) Why does DBCC IND still show two pages, even with the trace flag on? I've heard of some people being confused by the output of DBCC IND in SQL 2008 when the trace flag is turned on. Creating a single row temp table will only show two pages allocated in the DBCC output - one IAM page and one data page. Yes, that's completely correct - as only two pages are allocated, but the data page comes from a dedicated extent, not a mixed extent. (IAM pages are *always* single-page allocations from mixed-extents).

And now the proof, on SQL 2008.

SELECT @@VERSION;
GO

Microsoft SQL Server 2008 (RTM) - 10.0.1600.22 (Intel X86)

First off, I'll create a temp table without the trace flag enabled, and see what pages the table has allocated, by looking at the first IAM. I'll use a temp table with an 8000+ byte row size, and insert two rows - so we have two data pages for clarity.

DBCC TRACEOFF (1118, -1);
GO

USE tempdb;
GO

CREATE TABLE #temp (c1 INT IDENTITY, c2 CHAR (8000) DEFAULT 'a');
GO
INSERT INTO #temp DEFAULT VALUES;
GO 2

Now I'll figure out what is the first IAM page, using my sp_AllocationMetadata script (see here for the script and details), and dump it with DBCC PAGE to see the single-page allocations it's tracking, and which dedicated extents are allocated to the table:

EXEC sp_AllocationMetadata '#temp';
GO

Object Name    Index ID  Alloc Unit ID        Alloc Unit Type  First Page  Root Page  First IAM Page
-------------- --------- -------------------- ---------------- ----------- ---------- ---------------
#temp__<snip>  0         1152921505223016448  IN_ROW_DATA      (1:158)     (0:0)      (1:199)

DBCC TRACEON (3604);
GO
DBCC PAGE ('tempdb', 1, 199, 3);
GO

<snip>

IAM: Single Page Allocations @0x4A35C08E

Slot 0 = (1:158)                     Slot 1 = (1:200)                     Slot 2 = (0:0)
Slot 3 = (0:0)                       Slot 4 = (0:0)                       Slot 5 = (0:0)
Slot 6 = (0:0)                       Slot 7 = (0:0)                      


IAM: Extent Alloc Status Slot 1 @0x4A35C0C2

(1:0)        - (1:1016)     = NOT ALLOCATED   

As you can clearly see from the partial output of the dump of the IAM page, there are two single-page allocations and no extents allocated to the temp table. This is what should happen when the trace flag is not enabled.

Now I'll do the same thing with the trace flag 1118 enabled.

USE tempdb;
GO

DROP TABLE #temp;
GO

DBCC TRACEON (1118, -1);
GO

CREATE TABLE #temp (c1 INT IDENTITY, c2 CHAR (8000) DEFAULT 'a');
GO
INSERT INTO #temp DEFAULT VALUES;
GO 2

EXEC sp_AllocationMetadata '#temp';
GO

Object Name    Index ID  Alloc Unit ID        Alloc Unit Type  First Page  Root Page  First IAM Page
-------------- --------- -------------------- ---------------- ----------- ---------- ---------------
#temp__<snip>  0         1224979099301904384  IN_ROW_DATA      (1:208)     (0:0)      (1:158)

DBCC TRACEON (3604);
GO
DBCC PAGE ('tempdb', 1, 158, 3);
GO

<snip>

IAM: Single Page Allocations @0x4A8FC08E

Slot 0 = (0:0)                       Slot 1 = (0:0)                       Slot 2 = (0:0)
Slot 3 = (0:0)                       Slot 4 = (0:0)                       Slot 5 = (0:0)
Slot 6 = (0:0)                       Slot 7 = (0:0)                      


IAM: Extent Alloc Status Slot 1 @0x4A8FC0C2

(1:0)        - (1:200)      = NOT ALLOCATED                              
(1:208)      -              =     ALLOCATED                              
(1:216)      - (1:1016)     = NOT ALLOCATED
                              

Now as you can clearly see, there are no single-page allocations, and there's a single extent allocated to the table. Proof that trace flag 1118 still does exactly what it should in SQL Server 2008.

Now for a DBCC IND on the table: 

DBCC IND ('tempdb', '#temp', -1);
GO

PageFID PagePID     IAMFID IAMPID      ObjectID    IndexID    
------- ----------- ------ ----------- ----------- -----------
1       158         NULL   NULL        293576084   0
1       208         1      158         293576084   0
1       209         1      158         293576084   0

(I've removed some of the trailing columns for clarity.) We see that it still only lists the two data pages (1:208, 1:209) and the IAM page (1:158) - although an entire extent was allocated to the temp table, only two pages from the extent were actually allocated and used - the rest are reserved for use by that table, but remain unallocated. 

Hopefully this post has cleared up a lot of the confusion around this trace flag and what it does.

In this trip to India (teaching for MS again) we're in and out, with no time for sight-seeing, but there's always time for bird-watching! I got up early this morning and got some photos of some excellent birds. I've included a couple here, and the rest you can check out in the album on Facebook (anyone can access it) at http://www.facebook.com/album.php?aid=2023153&id=1293146061&l=d34686d782.

(Edit: here's another album from the next day, with a *pair* of owls! http://www.facebook.com/album.php?aid=2023240&id=1293146061&l=07cb46ce3e)

The bird on the left is a Spotted Owlet, and on the right is an Indian Pond Heron. Click the images for larger versions. Enjoy!

In my previous posts on FILESTREAM I discussed the directory structure of the FILESTREAM data container and how to map the directories to database tables and columns. In this post I'm going to explain how and when the FILESTREAM garbage collection process works as that doesn't seem to be documented anywhere (even in the FILESTREAM whitepaper I wrote for MS - it wasn't supposed to be that low-level). There seems to be a lot of confusion about how updates of FILESTREAM data work, and when the old versions of the FILESTREAM files are removed. I'm going to explain how it all works and then show you by example.

The basic behavior that is non-intuitive is that there's no such thing as a partial update of FILESTREAM data. If you have 10MB of data stored in a FILESTREAM column (and hence have a 10MB FILESTREAM file), then updating even a single byte of it will result in a whole new 10MB FILESTREAM file. Anything that relies on having an up-to-date version of the database (e.g. log backups, log-shipping, replication) will pick up the entire new 10MB FILESTREAM file. Every time an update is made to that data, a new 10MB FILESTREAM file is created and then subsequently backed-up, replicated, etc. This can lead to unexpectedly large log backups, or network traffic between replication nodes.

Once you realize that new versions of the FILESTREAM files are going to be created, the obvious follow-on question is: when do the old versions get removed? The answer is: it depends! Smile

The old versions are removed by a process called garbage collection - in much the same way that memory garbage collection runs for managed code and deallocates object instantiations that are no longer referenced by any variables. The key point is that nothing needs the object instantiation any more; otherwise the memory garbage collection would be corrupting the run-time memory of the managed code application. The same principle applies for FILESTREAM garbage collection - the old versions of the FILESTREAM files cannot be removed until they are no longer needed.

But what does 'no longer needed' mean for FILESTREAM files? Well, it's kind of the same as for transaction log records. An old version of a FILESTREAM file is no longer needed if the transaction that created it has committed or rolled back, AND there are no other technologies that must read it, like a log backup (when running in the FULL or BULK_LOGGED recovery models), or the transactional replication log reader. In fact, the transaction log VLF containing the log record of the creation of the FILESTREAM data file must be switched to inactive before the FILESTREAM file can be garbage collected. Note that I don't mention database mirroring - in SQL 2008 database mirroring and FILESTREAM cannot be used together.

Once the old FILESTREAM file is no longer needed, it is available for garbage collection. How does the garbage collection process know which FILESTREAM files to physically delete? The answer is that when the file is no longer needed, an entry is made in a special table called a 'tombstone' table. The garbage collection process scans the tombstone tables and removes only the FILESTREAM files with an entry in the tombstone table. You can read more about the tombstone tables in this blog post from the CSS blog.

So when does the garbage collection process actually run? It can't be part of log backups, because in the SIMPLE recovery model, you can't take log backups. The answer is that it runs as part of the database checkpoint process. This is what causes some confusion - an old FILESTREAM file will not be removed until after it is no longer needed AND a checkpoint runs.

Now let's see this stuff in action. I'm going to create a database with FILESTREAM data in and then play around with transactions, log backups, and checkpoints to show you garbage collection working.

CREATE DATABASE FileStreamTestDB ON PRIMARY
    (NAME = FileStreamTestDB_data,
    FILENAME = N'C:\Metro Demos\FileStreamTestDB\FSTestDB_data.mdf'),
FILEGROUP FileStreamFileGroup CONTAINS FILESTREAM
    (NAME = FileStreamTestDBDocuments,
    FILENAME = N'C:\Metro Demos\FileStreamTestDB\Documents')
LOG ON
    (NAME = 'FileStreamTestDB_log',
    FILENAME = N'C:\Metro Demos\FileStreamTestDB\FSTestDB_log.ldf');
GO

USE FileStreamTestDB;
GO

CREATE TABLE FileStreamTest1 (
    DocId UNIQUEIDENTIFIER ROWGUIDCOL NOT NULL UNIQUE,
    DocName VARCHAR (25),
    Document VARBINARY(MAX) FILESTREAM);
GO

Now I'm going to put the database into the FULL recovery model and take a full database backup - which means I must now take log backups to manage the size of the transaction log. It also means that a FILESTREAM file cannot be removed until it has been backed up.

ALTER DATABASE FileStreamTestDB SET RECOVERY FULL;
GO
BACKUP DATABASE FileStreamTestDB TO DISK = 'C:\SQLskills\FSTDB.bak';
GO

Now I'm going to create some FILESTREAM data.

INSERT INTO FileStreamTest1 VALUES (
    NEWID (), 'Paul Randal',
    CAST ('SQLskills.com' AS VARBINARY(MAX)));
GO

Looking in the FILESTREAM data container I created, I have the following file:

 

Remember from the previous blog posts, that the FILESTREAM file filenames are the database log sequence number at the time they were created. Now I'll update the value in an implicit transaction (no BEGIN TRAN and COMMIT TRAN).

UPDATE FileStreamTest1
    SET Document = CAST (REPLICATE ('Paul', 2000) AS VARBINARY(MAX))
    WHERE DocName LIKE '%Randal%';
GO

and we now have the following files:

 

The new file is the 8KB file and the old FILESTREAM value is the 1KB file. If I try doing an explicit CHECKPOINT, nothing changes as the old file is still required as it hasn't yet been backed up. Now I'll do a log backup.

BACKUP LOG FileStreamTestDB TO DISK = 'C:\SQLskills\FSTB_log.bak';
GO

And the files are all still there. Although the first 1KB file is no longer needed, a checkpoint hasn't occurred yet, so garbage collection hasn't run. Now running an explicit CHECKPOINT, the directory still contains the two files. What happened? The transaction log VLF containing the log record for the creation of the FILESTREAM file is still active, so the file is still needed. I have to do *another* log backup and checkpoint before garbage collection kicks in (as that will cause the log to cycle, when there's nothing happening in the database and no active transactions) and the directory view changes to:

 

The alternative would have been to generate more log records, spilling into the next transaction log VLF, then do another log backup which would mark the 'creation' VLF inactive, and then the next checkpoint would run garbage collection on the file. This, of course, would be the normal course of events in a production database.

So, don't get confused if you update a FILESTREAM file, then do a log backup and checkpoint and nothing happens. Remember the transaction log has to have progressed enough for the 'creation' VLF to be inactive too. You can prove this to yourself by creating an explicit transaction at the same time as the FILESTREAM update (in another, implicit transaction). No matter how many times you backup the log and checkpoint the database, the garbage collection will not run until the explicit transaction is committed or rolled back, and then another log backup and checkpoint is run.

I'll leave it as a fun exercise for you to play around with updates in explicit transactions and various backup scenarios to see when garbage collection can and cannot remove old files, but now you know exactly how it works.

A short post to start the day (in India) and then at lunchtime I'll do how it works: FILESTREAM garbage collection.

This is a question that came up recently on an MVP discussion list: why does a 200MB log backup take 40 minutes to restore on my log-shipping secondary?

First answer I thought of was that the I/O subsystem on the log-shipping secondary was very slow, or broken in some way, so I/Os were taking ages to complete. Nope - I/O subsystem was performing well.

Next thing I suggested was the the log-shipping secondary was restoring the log backups using WITH STANDBY and there was something like an index rebuild that was being rolled back. As an aside, when you restore log backups on the log-shipping secondary, you have a choice how they recovery: either WITH NORECOVERY or WITH STANDBY. The first option doesn't allow any access to the database, as the database is still "in recovery". The second option runs the REDO part of recovery, then runs the UNDO part of recovery, but saves the details of what it had to do for UNDO into a file (who's name and location you specify). It then allows read-only access to the database, for queries/reporting/whatever. Obviously if there's a lot of transaction log that has to be undone (i.e. rolling back transactions that weren't committed at the time the log backup completed), then this could take some time. But 40 minutes? No. That wasn't it.

I was partly right on my second guess. The *previous* log backup that was restored WITH STANDBY contained a long-running index operation, and so the undo file that the restore created was *huge*. The next log backup that's restored after a RESTORE ... WITH STANDBY, must first undo everything in the undo file (i.e. put the database back to the exact state it was in as if the WITH STANDBY part of the restore never happened) before it can restore more transaction log. In this case, no-one had noticed that the previous log restore *also* took a lot of time and created the huge undo file. It just looked like the 200MB log backup was causing the problem.

My advice: in any kind of there's-suddenly-a-corruption-or-performance-problem situation, don't just look at what's happening now. Always look at what happened leading up to the problem, as the answer is often there.

Every so often (well, ok, very frequently) my internal list of 'I must blog about that' gets too big and I have an almost uncontrollable urge. Luckily, these urges end up with me blogging several times, otherwise who knows what kind of trouble I'd get into? Smile

First up today is some pretty deep internals about how the transaction log works in a specific situation. This has come up a few times in the last few internals/maintenance classes I've taught in the transaction log module so I wanted to get a blog post out there that proves my answer is correct. The question is easier to frame with a picture, borrowed from my TechNet Magazine article from February (see here).

  

The image shows the circular natue of the transaction log, and how VLFs can be overwritten once the log has wrapped around, as long as they are no longer active (red = active log/VLF, blue = inactive log/VLF). If none of this makes sense to you, go read the TechNet Magazine article and it should do.

The question is: what happens if log records continue being written to VLFs 1 and 2 in the picture, and eventually the end of VLF 2 has been reached, but VLF 3 is still active, and can't be overwritten? Does the database just stop?

Let's find out.

I'm going to create a database and then engineer the situation above.

CREATE DATABASE LogTestDB ON PRIMARY (
    NAME = LogTestDB_data,
    FILENAME = N'C:\SQLskills\LogTestDB_data.mdf')
LOG ON (
    NAME = 'LogTestDB_log',
    FILENAME = N'C:\SQLskills\LogTestDB_log.ldf',
    SIZE = 5MB);
GO

The database is in the SIMPLE recovery model. Well, to be completely honest, it's in the FULL recovery model, but there hasn't been a full database backup yet, so as far as transaction log behavior is concerned, it behaves as if it was in SIMPLE (the log can be cleared by a CHECKPOINT) - again, if this makes no sense, go read the TechNet Magazine article. 

And then see how many VLFs it has using the DBCC LOGINFO command (and this is the *only* way to see how many VLFs there are):

DBCC LOGINFO ('LogTestDB');
GO

FileId   FileSize   StartOffset   FSeqNo   Status   Parity CreateLSN
-------- ---------- ------------- -------- -------- ------ -----------
2        1245184    8192          31       2        64     0
2        1245184    1253376       0        0        0      0
2        1245184    2498560       0        0        0      0
2        1499136    3743744       0        0        0      0

We've got 4 VLFs. The Status column tells us whether the VLF is active or not. A status of 2 means the VLF is active, 0 means it's not. The sequence number (FSeqNo) is the logical order of the VLFs within the log. The FileSize is in bytes (so each VLF is about 1.25 MB). Right now there's only one active VLF.

Now I'll engineer the situation in the image above. I'm going to fill the log so that VLFs 1, 2, and 3 are active. Then I'm going to start an explicit transaction that will hold VLF 3 and onwards active. Then I'll continue filling the log so it wraps around and starts to fill up VLF 1 again.

USE LogTestDB;
GO
CREATE TABLE BigRows (c1 INT IDENTITY, c2 CHAR (8000) DEFAULT 'a');
GO

SET NOCOUNT ON;
INSERT INTO BigRows DEFAULT VALUES;
GO 300

I've filled up VLFs 1 and 2, and started on 3. Let's check with DBCC LOGINFO:

DBCC LOGINFO ('LogTestDB');
GO

FileId   FileSize   StartOffset   FSeqNo   Status   Parity CreateLSN
-------- ---------- ------------- -------- -------- ------ -----------
2        1245184    8192          31       2        64     0
2        1245184    1253376       32       2        64     0
2        1245184    2498560       33       2        64     0
2        1499136    3743744       0        0        0      0

As you can see, the first 3 VLFs now have a status of 2 (active), and they're in sequence. Now I'll create an explicit transaction that will prevent VLF 3 and onwards being cleared.

BEGIN TRAN
INSERT INTO BigRows DEFAULT VALUES;
GO

If I explicitly do a CHECKPOINT now, VLFs 1 and 2 will clear:

CHECKPOINT
GO

DBCC LOGINFO ('LogTestDB');
GO

FileId   FileSize   StartOffset   FSeqNo   Status   Parity CreateLSN
-------- ---------- ------------- -------- -------- ------ ----------
2        1245184    8192          31       0        64     0
2        1245184    1253376       32       0        64     0
2        1245184    2498560       33       2        64     0
2        1499136    3743744       0        0        0      0

Now I'll carry on filling up the BigRows table so the log wraps around and starts filling up VLFs 1 and two again.

INSERT INTO BigRows DEFAULT VALUES;
GO 300

DBCC LOGINFO ('LogTestDB');
GO

FileId   FileSize   StartOffset   FSeqNo   Status   Parity CreateLSN
-------- ---------- ------------- -------- -------- ------ ----------
2        1245184    8192          35       2        128    0
2        1245184    1253376       32       0        64     0
2        1245184    2498560       33       2        64     0
2        1499136    3743744       34       2        64     0

You can see that the log has wrapped around now, but VLFs 3 and 4 are still active. Look at the sequence numbers of the active VLFs... the active log is VLF 3 then 4 then 1, with sequence numbers 33 then 34 then 35. Now if I carry on filling up the table, what's going to happen when the log bumps up against VLF 3 that is still active? Will it stop or will it grow?

INSERT INTO BigRows DEFAULT VALUES;
GO 300

DBCC LOGINFO ('LogTestDB');
GO

FileId   FileSize   StartOffset   FSeqNo   Status   Parity CreateLSN
-------- ---------- ------------- -------- -------- ------ ------------------
2        1245184    8192          35       2        128    0
2        1245184    1253376       36       2        128    0
2        1245184    2498560       33       2        64     0
2        1499136    3743744       34       2        64     0
2        253952     5242880       37       2        64     36000000049300052
2        270336     5496832       38       2        64     36000000049300052
2        253952     5767168       0        0        0      36000000107500066
2        335872     6021120       0        0        0      36000000107500066
2        253952     6356992       0        0        0      36000000190700020
2        401408     6610944       0        0        0      36000000190700020
2        253952     7012352       0        0        0      37000000037300040
2        466944     7266304       0        0        0      37000000037300040

The answer is that it grew, and kind of skipped the active VLFs! Look at the sequence numbers. The new sequence of the active log is VLF 3 then 4 then 1 then 2 then 5 then 6, as you can see from the sequence numbers. Once the active transaction I created commits or rolls back, VLFs 3, 4, 1, and 2 can clear and then the 'normal' sequencing of VLFs in the log will resume.

Every VLF has a small header which contains the sequence number of the VLF within the transaction log, so the log can kind of do contortions to work around active VLFs in the middle of the log. Very cool.

Ok - that was fun - now I feel better!

PS In the last couple of DBCC LOGINFO dumps, where the log has wrapped around, you can see that the parity bits for the log blocks in the VLFs have changed, as I explained in one of my previous posts Search Engine Q&A #24: Why can't the transaction log use instant initialization?.

Last week I kicked off a survey about the physical layout of your databases, based on database size. I'm running it for two weeks and will report on the results next Friday.

One thing I missed was 'multiple filegroups over multiple drives/arrays/LUNs for any reason' in the > 1TB category - not enough options on the free surveys I use. If you fall into this category, use the final option on the >1TB survey and I'll lump them all together.

As of this morning there were 860 responses to the survey so far - very cool. Check it out at http://www.sqlskills.com/BLOGS/PAUL/post/Weekly-survey-does-size-really-matter-or-is-it-what-you-do-with-it.aspx.

Thanks!

Categories:
Surveys

This feels like a bit of a strange post for me to make, but I want to make it anyway, and don't take it the wrong way...

I've recently started following SQLBatman's blog and I was astounded that he had to post a follow-up to his list of SQL blogger rankings (disclaimer: including ours - thanks!) because he'd been getting emails from people who obviously weren't as high in his list as they thought they should be, even though they were *his* rankings of their blogs. It struck me that with the amazingly connected society we live in now, and the ubiquitousness of things like blogs, Facebook (to which we're unfortunately addicted), and Twitter (which we're sternly resisting so far) - maybe we're becoming a little too focused on how popular our blogs are or how many people are following our Twitter feeds?

I used to be like that. When I first started blogging about 3 years ago on the Storage Engine blog at Microsoft, and then when I left and switched to this one a year later, I was pretty obsessed with how many people were reading the blog posts. Was I posting interesting info? Could I get away with jokes? Would my opinions offend someone? Was anyone linking to me? Why aren't I in XYZ's blog-roll?

Then I realized that, to be honest, the obsessing was taking the fun out of blogging and now I don't pay that much attention to the stats and just post stuff I think is interesting. I wanted to do this blog post to give a piece of advice to new bloggers and those who thought they should be higher up in, or just in, SQLBatman's (or anyone else's) rankings: don't get wrapped up in the numbers and rankings, and don't complain when someone doesn't rank your blog as highly as you think they should, or list it in their blog-roll. It's very easy to get bitter when a post doesn't get the numbers you think, or no-one links to it, and then start thinking along the lines of "why am I spending all this time doing this when no-one's watching?"

People are watching - there are thousands of 'lurkers' our there who you'll never see and never hear from, but they're watching and learning from what you post. Just focus on the content and if it's good you'll be discovered and/or ranked higher. Don't think about what to post that will make your ranking higher, think about what to post that's unique, interesting. and useful to people in the community. You can't really control what people think of what you post. I'm sure some people won't like this post, but I'm going to post it anyway.

From Field of Dreams 2: "If you build it, they will come".

Categories:
General

Not much to do here in Tokyo while waiting for the Bangkok flight except read and blog, and I've already done enough reading for one day on the flight here - so that leaves blogging!

Here's an interesting case that I got involved in on SQLServerCentral (here's the original thread, linked here with permission of the original poster). The problem was that every night the maintenance job would run and end up producing a stack dump, from a 211 error (which always indicates metadata corruption causing the Engine to trip over). I guessed system table corruption but the consistency checking portion of the maintenance job was apparently running fine. So I asked for one of the stack dumps, the output from the maintenance plan, and the latest SQL error log to have a look at (I have a weakness for these things, as you well know).

After having a look, I saw some strange behavior in the error log - every night there 10 or so DBCC CHECKDBs of master within about 2 minutes - corresponding to the consistency checking part of the maintenance job that was clearing running DBCC CHECKDB against all the system and user databases. Then I remembered that vanilla 2005 SP2 has two nasty maintenance plan bugs - one of which caused the consistency checking part of the maintenance plan to malfunction when run, and run DBCC CHECKDB against master every time instead of the database specified. Fixing that would then lead to the being able to properly run consistency checks and find the corrupt database.

So - moral of the story is to make sure you're not running vanilla 2005 SP2 (or one of the builds close to it, 3150 to 3158) otherwise you could get into problems with maintenance plans - worse still, you may think you've been running consistency checks all this time on your user databases, but if no-one's checking the SQL error logs, you could just be repeatedly consistency checking master.

The KB that describes the bug is 934459 (available at http://support.microsoft.com/kb/934459), or you can install SP3 (available at http://support.microsoft.com/?kbid=955706).

Microsoft's popular performance troubleshooting whitepaper has been updated for SQL Server 2008. You can download Troubleshooting Performance Problems in SQL Server 2008 at http://download.microsoft.com/download/D/B/D/DBDE7972-1EB9-470A-BA18-58849DB3EB3B/TShootPerfProbs2008.docx.

I've also added it to our whitepaper-links collection page at http://www.sqlskills.com/whitepapers.asp.

At the last few conferences I've presented at, there have been questions about using SSDs (Solid-State Drives) for enterprise storage and whether that will change some of the database maintenance practices. My answer to that is "I don't know" (ha - bet you thought I was going to say "It depends!") because adoption of SSDs is very low. I haven't been able to find much info about using them, but the Microsoft Research group in Cambridge just published a research paper Migrating Sever Storage to SSDs: Analysis of Tradeoffs, which does a nice job of walking through the issues involved and concludes that for the majority of workloads, it makes more economic sense to host them on HDDs. The exception is for top-end OLTP databases. I'll warn you that this isn't a whitepaper - it's a research paper, and gets a bit deep into algorithms and mathematical analyses, but if you're up to the challenge it's a great read.

You can download the paper from http://research.microsoft.com/en-us/um/people/antr/ms/ssd.pdf. Enjoy!

PS I found it on James Hamilton's blog.

This week's survey is a little more complicated. I'm interested in the physical layout of your databases. I've got four surveys, for a variety of database sizes. Please vote multiple times in each survey, as you see fit - and by all means forward this link to your friends/clients/etc or re-blog/tweet it. I'm going to report on this survey in two weeks, to give a bit more time for people to respond (and because we're travelling next week). I think we're going to see some interesting statistics come out of this - the more people that respond the better. I'll report on it 4/24/09.

One thing to note - this is just for user databases, not for tempdb. In the surveys, "multiple filegroups" implies multiple files too, and if you don't have them spread exactly one per drive/etc, just choose that option - I only have 10 options to choose from in the free surveys. 

(If you're in the over 1TB range and have multiple files/filesgroups spread over multiple drives/arrays/LUNs, vote using the last option on the >1TB survey and I'll lump them together.) 

Phew - thanks!

Last week's survey was on how *you* manage the size of your transaction log (see here for the survey). Here are the results as of 4/10/2009 - the most popular survey yet:

In my opinion (guarantee that I'm going to say stuff that some of you won't agree with!), there are only two valid answers in this survey: #1 - run in the FULL recovery model and take regular log backups, and #4 - run in the SIMPLE recovery model all the time. The last answer is applicable if you run out of log space even though you're in either of these situations but isn't a general strategy like #1 or #4. IMHO, you should be in one of these two situations and in the rest of this editorial I'll explain why. I'm not going to touch on *why* your transaction log might start to fill up, instead here are some links:

Now for the survey options:

  1. Take regular log backups. I'm very pleased to see the vast majority of respondents doing this, as it's the only valid log size management strategy when in the FULL recovery model (same thing applies to BULK_LOGGED of course, but with a few twists). Once you take that first database backup in the FULL recovery model, SQL Server assumes you're going to manage the transaction log size from that point on by taking log backups. Unfortunately that isn't documented in big, red flashing letters in Books Online - so people can get themselves into troubel inadvertently. Also, the FULL recovery model is the default, and is required for database mirroring - which further adds to the potential for people (such as involuntary DBAs) to accidentally switch into I-will-take-log-backups mode and then not take them. If you don't want to take log backups for recovery purposes, or you don't want to use database mirroring, don't use the FULL recovery model - it's as SIMPLE as that (ha ha). You might argue and say that you're only using FULL because of database mirroring, and don't want to take log backups. I'd argue back and say that if you care enough to have hot standby of your database, you must also take backups - as you can't rely solely on a redundant copy of your database on a different server.
  2. Use BACKUP LOG WITH NO_LOG/TRUNCATE_ONLY regularly. These two commands do basically the same thing - allow the log to be cleared without taking a log backup. What's the point if you're not taking log backups? - just switch to SIMPLE and let the checkpoints clear the log. In fact, in 2008 these two commands have been removed. See my blog post BACKUP LOG WITH NO_LOG - use, abuse, and undocumented trace flags to stop it.
  3. Use BACKUP LOG WITH NO_LOG/TRUNCATE_ONLY when it fills up. Same as above. You might argue that you're only keeping the log around in case there's a disaster, so that you can take a log backup at that point and use it to recover up to the point of the disaster. I'd argue that's broken on two counts: 1) what if the log file is damaged and you can't back it up? 2) that's *all* the transaction log since the last full database backup you took (if you break the log backup chain and then take a full database backup, that backup becomes the base of subsequent log backups) so that may take a long time to restore and replay...
  4. Run in the SIMPLE recovery model all the time. If you don't need to use FULL, don't. Running in SIMPLE is perfectly acceptable, as long as you don't mind losing work in the event of a disaster.
  5. Switch to SIMPLE when it fills up, then back to FULL. This is like #s 2 and 3 - what's the point?
  6. Switch to SIMPLE when it fills up, shrink the log, then switch back to FULL. This is worse than 2, 3, or 5. If you shrink the log, then it's going to grow again - possibly causing VLF fragmentation (see Transaction Log VLFs - too many or too few?), and definitely causing your workload to pause while the log grows, as the log can't use instant initialization (see Search Engine Q&A #24: Why can't the transaction log use instant initialization?).
  7. Shutdown SQL Server and delete the transaction log file(s). Just don't get me started on this one - I'm glad no-one 'fessed up to doing it. There are many reasons why this is daft, including: 1) you have to shutdown to do it, so your workload is off-line 2) if the database wasn't cleanly shut down, it won't be able to be started again without using EMERGENCY mode repair, and your data will be transactionally inconsistent 3) as the log can't be instant initialized, the database won't come online until the log has been created and zero'd. Just don't do this. Ever.
  8. Add more transaction log files. As a general strategy for managing the size of the log, this is not good. Extra log files have zero effect on performance (the myth that SQL Server writes in parallel to the log files is just that - a myth) and make management more tricky. However, if your log fills up for some out-of-the-ordinary reason, then adding another log file may be the only way to allow the database to keep running. If you have to do this though, make sure that you remove the extra log file(s) when you can to keep log management uncomplicated.

To summarize, if you want to be able to take log backups to aid in point-in-time or up-to-the-second recovery, use the FULL recovery model. If not, use SIMPLE and you won't need to mess around with the log when it fills up because you're not taking log backups.

Next post - this week's survey! (And thanks to all those who are responding to them!)

PS For those of you who sent me details about your databases from the survey back at the start of March (see here) - I haven't forgotten. I was waiting to get a decent sample size and now I'm going to go through the data. If you want to send me any more data, you've got until Sunday.

Theme design by Nukeation based on Jelle Druyts