New VLF status value

At least since I started working on the SQL Server team (just after 7.0 shipped) and since then there have only been two VLF status codes:

  • 0 = the VLF is not active (i.e. it can be (re)activated and overwritten)
  • (1 = not used and no-one seems to remember what it used to mean)
  • 2 = the VLF is active because at least one log record in it is ‘required’ by SQL Server for some reason (e.g. hasn’t been backed up by a log backup or scanned by replication)

A few weeks ago I learned about a new VLF status code that was added back in SQL Server 2012 but hasn’t come to light until recently (at least I’ve never encountered it in the wild). I went back-and-forth with a friend from Microsoft (Sean Gallardy, a PFE and MCM down in Tampa) who was able to dig around in the code to figure out when it’s used.

It can show up on an Availability Group secondary replica (only, not on the primary, and isn’t used in Database Mirroring):

  • 4 = the VLF exists on the primary replica but doesn’t really exist yet on this secondary replica

The main case where this status can happen is when a log file growth (or creation of an extra log file) has occurred on the primary replica but it hasn’t yet been replayed on the secondary replica. More log records are generated on the primary, written to a newly-created VLF, and sent across to the secondary replica to be hardened (written to the replica’s log file). If the secondary hasn’t yet replayed the log growth, the VLFs that should contain the just-received log records don’t exist yet, so they’re kind of temporarily created with a status of 4 so that the log records from the primary can be hardened successfully on the secondary. Eventually the log growth is replayed on the secondary and the temporary VLFs are fixed up correctly and change to a status of 2.

It makes perfect sense that a scheme like this exists, but I’d never really thought about this case or experimented to see what happens.

Anyway, now you know what status 4 means if you ever see it (and thanks Sean!)

SQLskills SQL101: Why does my heap have a bunch of empty pages?

SQLskills has an ongoing initiative to blog about basic topics, which we’re calling SQL101. We’re all blogging about things that we often see done incorrectly, technologies used the wrong way, or where there are many misunderstandings that lead to serious problems. If you want to find all of our SQLskills SQL101 blog posts, check out SQLskills.com/help/SQL101.

Here’s a question I’m often asked (paraphrasing):

I’ve got a large heap table where the space isn’t being given up when I delete a large number of records, but then when I shrink the database the heap is reduced in size. Can you explain?

This behavior is how SQL Server works, but it’s pretty non-intuitive. When a page in an index becomes empty, it’s always deallocated, as an empty page isn’t allowed in a SQL Server index structure from SQL Server 2005 onward. However, the structure of a heap is different and as a result, the behavior is too.

Whenever a row is deleted in a heap, it’s usually the case that the page containing the row does not become empty. However, if the heap page that the row is stored on becomes empty as a result of the delete, the page cannot be deallocated from the table unless an exclusive (X) table lock is held (to remove the page from the internal free-space “tracking” mechanism that is in place for heaps). It’s not common that a table level X lock is held unless lock escalation has occurred (because you’re deleting enough rows to trigger escalation from individual row X locks to a single table X lock), or if you specifically use the TABLOCK hint on the delete statement, for instance. But, because both of these circumstances are unlikely, the empty heap page usually cannot be deallocated.

There is a Knowledge Base article that describes this phenomenon: KB 913399. However, the KB article only references up to and including SQL Server 2005 but this behavior exists in all version and is very easy to reproduce if you want to prove it to yourself. Also, if you’re using one of the flavors of snapshot isolation then even using TABLOCK won’t allow the pages to be deallocated.

The script below will create a database, create a table with one row per heap page, show you how many pages and rows, and how full each page is. Next it’ll delete all the rows, and show you the pages are still there, with no rows, and empty. Give it a try!

CREATE DATABASE [EmptyHeapTest];
GO
USE [EmptyHeapTest];
GO

CREATE TABLE [test] ([c1] INT IDENTITY, [c2] CHAR (8000) DEFAULT 'a');
GO

SET NOCOUNT ON;
GO
-- Insert 1000 rows, which will allocate 1000 pages
INSERT INTO [test] DEFAULT VALUES;
GO 1000

-- How many pages and how full?
SELECT [page_count], [record_count], [avg_page_space_used_in_percent]
FROM sys.dm_db_index_physical_stats (DB_ID (N'EmptyHeapTest'), OBJECT_ID (N'test'), 0, 	DEFAULT, 'DETAILED');
GO
page_count           record_count         avg_page_space_used_in_percent
-------------------- -------------------- ------------------------------
1000                 1000                 98.974549048678
-- Delete all the rows
DELETE FROM [test];
GO

-- Did all the pages get deleted?
SELECT [page_count], [record_count], [avg_page_space_used_in_percent]
FROM sys.dm_db_index_physical_stats (DB_ID (N'EmptyHeapTest'), OBJECT_ID (N'test'), 0, 	DEFAULT, 'DETAILED');
GO 
page_count           record_count         avg_page_space_used_in_percent
-------------------- -------------------- ------------------------------
1000                 0                    0

The empty pages will be reused by subsequent inserts (into the same table, of course), but if the space isn’t going to be reused following a large delete in a heap, you might consider using the TABLOCK hint to allow the empty pages to be deallocated and the space made available for other objects in the database to use.

Another alternative is to just use a clustered index instead, or if a heap is necessary, you could rebuild the heap using ALTER TABLE … REBUILD (that was added in SQL Server 2008 to support enabling compression on a heap), with the caveat that this will cause all the table’s nonclustered indexes to be rebuilt.

On the extreme end (in my opinion), you could reclaim the empty heap space using a shrink operation. Shrink won’t free up space inside pages as it moves them (with the exception of compacting LOB pages as it goes – somewhat unsuccessfully depending on which version and build you’re on – see KB 2967240), but it will remove empty pages rather than moving them. This will effectively shrink the heap after a large delete, but with the usual caveats about shrink causing index fragmentation and generally being an expensive, slow operation to perform.

So now you know why you may have a bunch of empty pages in your heaps!

SQLskills SQL101: Running out of ints and bigints

As Kimberly blogged about earlier this year, SQLskills has an ongoing initiative to blog about basic topics, which we’re calling SQL101. We’re all blogging about things that we often see done incorrectly, technologies used the wrong way, or where there are many misunderstandings that lead to serious problems. If you want to find all of our SQLskills SQL101 blog posts, check out SQLskills.com/help/SQL101.

We’re teaching a class this week, and one topic that always comes up is using an int identity as a clustering key and the possibility of running out of integers. Depending on your insert volume, this might be quite likely, as an int can only store 2^32 (^ = ‘to the power’) or about 4 billion values, between -2^31 and 2^31-1.

Imagine that you have a theoretical system that can create a thousand data rows per second. Using an int identity value increasing by 1 and starting at 1, you’ll run out of values when the value hits 2^31-1 and tries to insert the next value. Let’s simplify the math by just saying that 2^31 is the limit. With a thousand values per second, that would mean 2^31 / 1,000 = 2.15 million seconds or just under 25 days. While many of you don’t sustain 1,000 rows per second, this is still a very problematic limitation.

One solution is to use a bigint identity as the key. This can store 2^64 or about 18.5 quintillion (18.5 billion billion) values, between -2^63 and 2^63-1.

Every so often someone asks whether it’s possible to run out of bigint values. My answer is no. Well, technically yes, there is a limit, but in practical terms the answer is no.

Now imagine that you have a theoretical system that can create a million data rows per second, with a bigint identity value increasing by 1 and starting at 1. You’ll run out of values when the value hits 2^63-1 and tries to insert the next value. With a million values per second, that would mean 2^63 / 10^6 = 9.2 trillion seconds or approximately 292.5 thousand years. And by then it’s someone else’s problem… :-) And that’s only for half the possible range of bigint values.

Now what about the storage for those values? Doing a quick test of a heap with a single bigint identity column shows me that I can get 453 rows per 8KB data file page (don’t forget the record overhead, slot array overhead, and that the heap pages won’t be filled completely because of the way the free space caching and searching works). A terabyte of data would store roughly 61 billion rows.

So with 1 million rows per second, you’ll be generating 1 million x 3,600 (seconds in an hour) x 24 (hours in a day) = 86.4 billion rows per day, so you’ll need about 1.4 terabytes of new storage per day. If you’re using the bigint identity as a cluster key, each row needs new space, so you’ll need almost exactly 0.5 petabytes of new storage every year.

At that rate, actually running out of bigint values AND storing them would take roughly 150 thousand petabytes. This is clearly impractical – especially when you consider that storing *just* a bigint is pretty pointless – you’d be storing a bigint and some other data too – probably doubling the storage necessary, at least.

Why is this interesting? We’ve had a number of clients over the years that didn’t consider their data volume and designed a schema using int keys instead of bigint keys. When the inevitable happened and they ran out of int values, the process of changing to a bigint key was quite painful – as there’s no really easy, space and log efficient way to do it once you have the 2 billion rows, and especially if constraints are involved, and application changes need to be made to allow 8-byte values instead of 4-byte values in result sets.

A common stop-gap solution people use when they run out of int values is to just reset the identity seed to -2^31 and then set the increment to 1. As a short-term solution this does work, especially if the int key is a surrogate key and doesn’t have a business meaning, but it’s not ideal as a long term solution as you’ll only run out again once the int key kits -2^31. Ultimately, you’ll need to make the int to bigint change.

Summary: make sure that when you’re designing a new schema, you think through the maximum values required and pick appropriate data types then and there. Changing data types can be very painful once the system has been in production for a while and there’s a lot of data in the schema.

PS If you honestly believe you’ll run out of bigint values, you can use a decimal or numeric value, both of which can hold -10^38 to 10^38+1. Those are really big numbers. 10^ 38 is about 2^129, or 100 undecillion, or 2^64 times more values than a bigint can hold. Using our million-row-per-second server, inserting 10^38 values would take 10^38 / 10^6 = 10^32 seconds = roughly 3,170 billion billion years. Now if you’re concerned about *that*, the sun will have become a red giant and incinerated the Earth in about 5 billion years…