Read committed doesn’t guarantee much…

A while back I was involved in an email thread where people were wondering about some ‘weird’ behavior from SQL Server. The problem was occurring on SQL Server 2016 and used the default isolation level of read committed. The scenario was as follows:

  • Create a table with a few columns
  • Batch 1: In one SSMS window, do the following (which takes 10 seconds to run):
    • Start a transaction
    • Insert 1,000 rows into the table, with a 0.01 second WAITFOR DELAY between each insert
    • Commit the transaction
  • Batch 2: In a second SSMS window:
    • Select * from the table

The ‘weird’ behavior is that when the “Batch 2” select completes, after having been blocked by the “Batch 1” transaction, it doesn’t return all 1,000 rows (even though “Batch 1” has completed). Furthermore, depending on when the “Batch 2” select is started, during the 10-seconds that “Batch 1” executes, “Batch 2” returns different numbers of rows. This behavior had also been reported on earlier versions of SQL Server as well. It’s easy to reproduce on SQL Server 2016/2017 and can be reproduced in all earlier versions with a single configuration change (more details in a moment).

Additionally, if the table has a clustered index created, 1,000 rows are returned every time, on all versions of SQL Server.

So why is this weird? Many people expect that all 1,000 rows will be returned every time AND that the structure of the table or the version of SQL Server should not make any difference.

Unfortunately, that assumption is not correct when using read committed. The read committed isolation level guarantees that only committed data will be read; it does NOT guarantee how much of the committed data will be returned and is prone to inconsistent analysis problems. If you want to avoid inconsistent analysis, you’d need to increase your isolation level (if using locking) or change to versioning for read committed (or even snapshot isolation).

However, I do agree that the assumption is reasonable, even though it’s not correct.

The definition of the requirement is straightforward however. SQL Server guarantees that only committed data will be read, not how much of it will be read, and so the amount of committed data returned in this case is variable. Furthermore, depending on the table structure, SQL Server version, and configuration option (which I’ve mentioned twice now, but haven’t explained as I don’t want to spoil my story…), the number of rows returned will vary wildly.

So, what’s going on? Why the discrepancy in the number of rows?

It’s because the table is a heap.

The ‘weird’ behavior manifests itself when the heap has an extent allocated to it immediately, from which the first data pages are allocated. When the allocation-order scanner for the select starts, it looks at the PFS bytes for the 8 pages in the extent to see which ones are allocated. These will be scanned. Depending on when the select starts within the 10 seconds that “batch 1” executes, there will be more rows or fewer rows read by the scanner (and returned in the “batch 2” results) because more or fewer of the PFS byte 0x40 bits will have been set indicating the page in the extent is allocated.

For a table to have an extent allocated to it immediately, mixed pages have to be disabled, which is the default for SQL Server 2016 onward, which is why the ‘weird’ behavior manifests with a small number of rows on SQL Server 2016. However, if you’ve enabled trace flag 1118, you’ll see this behavior in all versions. This is why some people report seeing the ‘weird’ behavior on versions earlier than SQL Server 2016.

When mixed extents are NOT disabled, i.e. in earlier versions than SQL Server 2016 when 1118 is not enabled, the first 8 pages allocated are mixed pages, and so the allocation order scanner has to pick them up individually and sees them all (without going into details of the synchronization around the single-page slot array on the first IAM page in the IAM chain/allocation unit for the table), and so as long as no more than 8 data pages are used by the insert, all the rows on them will be returned by the select in this example.

So there you have it, the behavior is “by design” and hasn’t changed at all, but is understandably confusing unless you know exactly how the Engine is performing the select under the covers.

And if you create a clustered index, the ‘weird’ behavior doesn’t occur for this example in any version. This is because the select gets hung up on the S(hare) row lock for the first row in the clustered index, and by the time the select is granted the S lock, the insert has completed and the index leaf scan picks up all 1,000 rows.

Don’t think that using a clustered index stops ‘weird’ behavior using read committed; it doesn’t. However, it can reduce some forms of inconsistencies. In this old blog post, I demonstrate a scenario where a scan of clustered index returns four rows when only three rows exist, using read committed. Kimberly also describes and shows “inconsistent analysis” in these blog posts:

So to summarize: the read committed isolation guarantees that only committed data is read; however, there are many forms of potential concerns (known as “inconsistent analysis”). It really pays to understand what these are and what the various isolation levels that SQL Server provides do and do not allow. In addition to the blog posts above, check out the MSDN lesson titled: Understanding the Available Transaction Isolation Levels here.

SQLskills SQL101: Running out of ints and bigints

As Kimberly blogged about earlier this year, SQLskills has an ongoing initiative to blog about basic topics, which we’re calling SQL101. We’re all blogging about things that we often see done incorrectly, technologies used the wrong way, or where there are many misunderstandings that lead to serious problems. If you want to find all of our SQLskills SQL101 blog posts, check out

We’re teaching a class this week, and one topic that always comes up is using an int identity as a clustering key and the possibility of running out of integers. Depending on your insert volume, this might be quite likely, as an int can only store 2^32 (^ = ‘to the power’) or about 4 billion values, between -2^31 and 2^31-1.

Imagine that you have a theoretical system that can create a thousand data rows per second. Using an int identity value increasing by 1 and starting at 1, you’ll run out of values when the value hits 2^31-1 and tries to insert the next value. Let’s simplify the math by just saying that 2^31 is the limit. With a thousand values per second, that would mean 2^31 / 1,000 = 2.15 million seconds or just under 25 days. While many of you don’t sustain 1,000 rows per second, this is still a very problematic limitation.

One solution is to use a bigint identity as the key. This can store 2^64 or about 18.5 quintillion (18.5 billion billion) values, between -2^63 and 2^63-1.

Every so often someone asks whether it’s possible to run out of bigint values. My answer is no. Well, technically yes, there is a limit, but in practical terms the answer is no.

Now imagine that you have a theoretical system that can create a million data rows per second, with a bigint identity value increasing by 1 and starting at 1. You’ll run out of values when the value hits 2^63-1 and tries to insert the next value. With a million values per second, that would mean 2^63 / 10^6 = 9.2 trillion seconds or approximately 292.5 thousand years. And by then it’s someone else’s problem… :-) And that’s only for half the possible range of bigint values.

Now what about the storage for those values? Doing a quick test of a heap with a single bigint identity column shows me that I can get 453 rows per 8KB data file page (don’t forget the record overhead, slot array overhead, and that the heap pages won’t be filled completely because of the way the free space caching and searching works). A terabyte of data would store roughly 61 billion rows.

So with 1 million rows per second, you’ll be generating 1 million x 3,600 (seconds in an hour) x 24 (hours in a day) = 86.4 billion rows per day, so you’ll need about 1.4 terabytes of new storage per day. If you’re using the bigint identity as a cluster key, each row needs new space, so you’ll need almost exactly 0.5 petabytes of new storage every year.

At that rate, actually running out of bigint values AND storing them would take roughly 150 thousand petabytes. This is clearly impractical – especially when you consider that storing *just* a bigint is pretty pointless – you’d be storing a bigint and some other data too – probably doubling the storage necessary, at least.

Why is this interesting? We’ve had a number of clients over the years that didn’t consider their data volume and designed a schema using int keys instead of bigint keys. When the inevitable happened and they ran out of int values, the process of changing to a bigint key was quite painful – as there’s no really easy, space and log efficient way to do it once you have the 2 billion rows, and especially if constraints are involved, and application changes need to be made to allow 8-byte values instead of 4-byte values in result sets.

A common stop-gap solution people use when they run out of int values is to just reset the identity seed to -2^31 and then set the increment to 1. As a short-term solution this does work, especially if the int key is a surrogate key and doesn’t have a business meaning, but it’s not ideal as a long term solution as you’ll only run out again once the int key kits -2^31. Ultimately, you’ll need to make the int to bigint change.

Summary: make sure that when you’re designing a new schema, you think through the maximum values required and pick appropriate data types then and there. Changing data types can be very painful once the system has been in production for a while and there’s a lot of data in the schema.

PS If you honestly believe you’ll run out of bigint values, you can use a decimal or numeric value, both of which can hold -10^38 to 10^38+1. Those are really big numbers. 10^ 38 is about 2^129, or 100 undecillion, or 2^64 times more values than a bigint can hold. Using our million-row-per-second server, inserting 10^38 values would take 10^38 / 10^6 = 10^32 seconds = roughly 3,170 billion billion years. Now if you’re concerned about *that*, the sun will have become a red giant and incinerated the Earth in about 5 billion years…

SQLskills SQL101: How can a SELECT cause a database to change?

As Kimberly blogged about recently, SQLskills is embarking on a new initiative to blog about basic topics, which we’re calling SQL101. We’ll all be blogging about things that we often see done incorrectly, technologies used the wrong way, or where there are many misunderstandings that lead to serious problems. If you want to find all of our SQLskills SQL101 blog posts, check out

This is an interesting misconception that I was asked about last week: (paraphrasing) Surely a SELECT operation can’t cause a database to change, because it’s just reading data, not altering it in any way, right?

Well, no. There are actually quite a few side effects of queries that only read data and never perform data changes (not counting a SELECT … INTO, of course). Here are four that spring to mind…

Statistics Creation

If the database property Auto Create Statistics is set to True, when a query is being compiled and the query optimizer determines that a statistic could be created that would aid the optimization process, it will create that statistic before optimization continues, thus changing the database. Your SELECT statement could cause this to happen.

Statistics Update

If the database property Auto Update Statistics is set to True, when a query is being compiled and a necessary statistic is determined to be out-of-date, it will be automatically updated before optimization continues, thus changing the database. Your SELECT statement could cause this to happen. Additionally, if the Auto Update Statistics Asynchronously property is enabled, the statistic will be automatically updated, but after the optimization process (so the compiling query doesn’t have to wait).

Ghost Cleanup

Ghost cleanup is the funky process for removing deleted records. For all indexes, and for heaps when some form of snapshot isolation is involved, deleting a record just marks it as deleted. After the deleting transaction commits, the deleted record is later removed by a background process called the ghost cleanup task. The interesting thing though is that a deleted record is not immediately entered in the task’s list of things to do. It’s usually not until the *next* use of the data file page that the Storage Engine sees that there’s a deleted record and enters it in the task’s to-do list. So, your SELECT statement could be that ‘next’ use of a data file page with a recently deleted record on that causes the record to be cleaned up by the ghost cleanup task.

Query Store

From SQL Server 2016 onward, if you have the Query Store enabled, by default every query execution will cause some metrics to be captured and stored in the Query Store’s system tables (which you can query with various DMVs). You can stop it capturing information for all queries by setting the QUERY_CAPTURE_MODE to AUTO, which causes it to not capture “insignificant” queries, but you can’t specify what “insignificant” means. Anyway, your SELECT statement could cause this to happen.


So as you can see, just because you’re not deliberately making a change in a database, that doesn’t mean that you won’t trigger something else in SQL Server to make a change. And then there’s the Auto Shrink option for a database, which of course should *never* be enabled! (see here for explanation…)