SQLskills SQL101: Running out of ints and bigints

As Kimberly blogged about earlier this year, SQLskills has an ongoing initiative to blog about basic topics, which we’re calling SQL101. We’re all blogging about things that we often see done incorrectly, technologies used the wrong way, or where there are many misunderstandings that lead to serious problems. If you want to find all of our SQLskills SQL101 blog posts, check out SQLskills.com/help/SQL101.

We’re teaching a class this week, and one topic that always comes up is using an int identity as a clustering key and the possibility of running out of integers. Depending on your insert volume, this might be quite likely, as an int can only store 2^32 (^ = ‘to the power’) or about 4 billion values, between -2^31 and 2^31-1.

Imagine that you have a theoretical system that can create a thousand data rows per second. Using an int identity value increasing by 1 and starting at 1, you’ll run out of values when the value hits 2^31-1 and tries to insert the next value. Let’s simplify the math by just saying that 2^31 is the limit. With a thousand values per second, that would mean 2^31 / 1,000 = 2.15 million seconds or just under 25 days. While many of you don’t sustain 1,000 rows per second, this is still a very problematic limitation.

One solution is to use a bigint identity as the key. This can store 2^64 or about 18.5 quintillion (18.5 billion billion) values, between -2^63 and 2^63-1.

Every so often someone asks whether it’s possible to run out of bigint values. My answer is no. Well, technically yes, there is a limit, but in practical terms the answer is no.

Now imagine that you have a theoretical system that can create a million data rows per second, with a bigint identity value increasing by 1 and starting at 1. You’ll run out of values when the value hits 2^63-1 and tries to insert the next value. With a million values per second, that would mean 2^63 / 10^6 = 9.2 trillion seconds or approximately 292.5 thousand years. And by then it’s someone else’s problem… :-) And that’s only for half the possible range of bigint values.

Now what about the storage for those values? Doing a quick test of a heap with a single bigint identity column shows me that I can get 453 rows per 8KB data file page (don’t forget the record overhead, slot array overhead, and that the heap pages won’t be filled completely because of the way the free space caching and searching works). A terabyte of data would store roughly 61 billion rows.

So with 1 million rows per second, you’ll be generating 1 million x 3,600 (seconds in an hour) x 24 (hours in a day) = 86.4 billion rows per day, so you’ll need about 1.4 terabytes of new storage per day. If you’re using the bigint identity as a cluster key, each row needs new space, so you’ll need almost exactly 0.5 petabytes of new storage every year.

At that rate, actually running out of bigint values AND storing them would take roughly 150 thousand petabytes. This is clearly impractical – especially when you consider that storing *just* a bigint is pretty pointless – you’d be storing a bigint and some other data too – probably doubling the storage necessary, at least.

Why is this interesting? We’ve had a number of clients over the years that didn’t consider their data volume and designed a schema using int keys instead of bigint keys. When the inevitable happened and they ran out of int values, the process of changing to a bigint key was quite painful – as there’s no really easy, space and log efficient way to do it once you have the 2 billion rows, and especially if constraints are involved, and application changes need to be made to allow 8-byte values instead of 4-byte values in result sets.

A common stop-gap solution people use when they run out of int values is to just reset the identity seed to -2^31 and then set the increment to 1. As a short-term solution this does work, especially if the int key is a surrogate key and doesn’t have a business meaning, but it’s not ideal as a long term solution as you’ll only run out again once the int key kits -2^31. Ultimately, you’ll need to make the int to bigint change.

Summary: make sure that when you’re designing a new schema, you think through the maximum values required and pick appropriate data types then and there. Changing data types can be very painful once the system has been in production for a while and there’s a lot of data in the schema.

PS If you honestly believe you’ll run out of bigint values, you can use a decimal or numeric value, both of which can hold -10^38 to 10^38+1. Those are really big numbers. 10^ 38 is about 2^129, or 100 undecillion, or 2^64 times more values than a bigint can hold. Using our million-row-per-second server, inserting 10^38 values would take 10^38 / 10^6 = 10^32 seconds = roughly 3,170 billion billion years. Now if you’re concerned about *that*, the sun will have become a red giant and incinerated the Earth in about 5 billion years…

SQLskills SQL101: Readable secondary performance problems

As Kimberly blogged about earlier this year, SQLskills has an ongoing initiative to blog about basic topics, which we’re calling SQL101. We’re all blogging about things that we often see done incorrectly, technologies used the wrong way, or where there are many misunderstandings that lead to serious problems. If you want to find all of our SQLskills SQL101 blog posts, check out SQLskills.com/help/SQL101.

Yesterday I blogged about log shipping performance issues and mentioned a performance problem that can be caused by using availability group readable secondaries, and then realized I hadn’t blogged about the problem, only described it in our Insider newsletter. So here’s a post about it!

Availability groups (AGs) are pretty cool, and one of the most useful features of them is the ability to read directly from one of the secondary replicas. Before, with database mirroring, the only way to access the mirror database was through the creation of a database snapshot, which only gave a single, static view of the data. Readable secondaries are constantly updated from the primary so are far more versatile as a reporting or non-production querying platform.

But I bet you didn’t know that using this feature can cause performance problems on your primary replica?

As with most things in life, you don’t get anything for free. Readable secondaries are really useful, but there is a performance trade off you need to be aware of. All queries that are executed against a readable secondary are automatically run using read-committed snapshot isolation. This means they do not require share locks and so will not block any database changes being replayed from the primary replica (i.e. the constant redo of log records on the secondary replica that have been sent from the primary replica).

To do this requires the use of the versioning system, where (simplistically) pre-change versions of records are copied into the version store in tempdb and queries work out which version of the record is the correct one for them to process, based on the query’s starting time. All records that change get a 14-byte tag added on the end of the record that allows a query to see if this is the correct record, and if not to follow a pointer to the previous version of the record in the version store. This has been the mechanism since snapshot isolation and read-committed snapshot isolation were introduced in SQL Server 2005.

Now consider this: all AG replicas are exact copies of the primary replica. So how can versioning work on the readable secondary, adding 14-byte tags to some records? That must break the ‘exact copy’ rule, right?

Well, yes, it would… if the primary replica didn’t also change.

When a readable secondary is configured in an AG environment, all changing records on the primary replica start getting empty 14-byte versioning tags added to them. This is so that the 14-bytes of extra space on the record is noted in the transaction log and replayed on the secondary replicas, allowing the readable secondary to make use of the empty 14-byte space to store the versioning tag it needs.

This doesn’t break the ‘exact copy’ rule because the 14-bytes isn’t used for anything to do with recovery, there just has to be 14-bytes there.

So versioning tags start getting added to changing records on the primary (to be clear, it doesn’t turn on versioning on the primary) so table and index records start to get 14-bytes longer. And what happens when records get longer on pages where there isn’t enough space? Page splits in your indexes (and forwarded records in heaps – but I’ll concentrate on indexes here) leading to low page densities (wasted disk space and buffer pool memory), logical fragmentation (poor scan performance), and a bunch of extra, expensive log record generation from the page splits themselves.

To counteract this, you’ll need to implement (and/or possibly lower existing) fill factors on your indexes and even potentially start doing index maintenance on indexes that may not have required it previously. Quite an insidious problem that can be hard to figure out unless you know what’s going on under the covers!

See the following blog posts for more info:

This MSDN page has more general information and this whitepaper from Microsoft explains in more depth the various performance impacts from using readable secondaries: AlwaysOn Solution Guide: Offloading Read-Only Workloads to Secondary Replicas.

If you’re implementing readable secondaries in your AG configuration, make sure that you also investigate and implement index fill factors in the database so that the versioning tags that are added under the covers don’t start causing page splits and fragmentation.

SQLskills SQL101: Log shipping performance problems

As Kimberly blogged about earlier this year, SQLskills has an ongoing initiative to blog about basic topics, which we’re calling SQL101. We’re all blogging about things that we often see done incorrectly, technologies used the wrong way, or where there are many misunderstandings that lead to serious problems. If you want to find all of our SQLskills SQL101 blog posts, check out SQLskills.com/help/SQL101.

One question I’m asked regularly is this:  When our log shipping secondary is applying log backups, sometimes it takes a lot longer than usual. Any idea why this might be the case?

Log shipping has been around forever, and it’s still a hugely applicable and useful feature for very simply maintaining one or more secondary copies of a database. You can also use a secondary copy for reporting, where the restore of the log backup uses the WITH STANDBY option, leaving the secondary database in an accessible, but read-only state (when the logs aren’t being applied).

This works as follows:

  1. Make sure all users are disconnected from the secondary database
  2. Write all the log records from the log backup into the secondary database’s log file
  3. Perform the REDO part of recovery (ensuring that all operations from committed transactions are present in the secondary database)
  4. Perform the UNDO part of recovery (ensuring that all operations from uncommitted transactions are not present in the secondary database)

Step 4 writes all the log records generated by the UNDO operations into a special file called the undo file. This means that the secondary database is in read-only mode and is transactionally-consistent so that users can access it. The reason the log records are written into the undo file is so that the transaction log of the secondary database is not altered in any way, allowing subsequent log backups to be restored. If this weren’t the case, the UNDO log records would advance the secondary database’s LSN (Log Sequence Number), meaning that subsequent log backup restore operations would fail.

When the restore process begins on the secondary database, if an undo file exists, there is another step that is performed before steps 2-4 above. This additional step needs to take all the log records in the undo file and undo the effects of them – essentially putting the secondary database back into the state as of the end of step 3 from the previous restore. This database state is the same as if the previous log backup had been restored using WITH NORECOVERY instead of WITH STANDBY.

The occasional long-running restore problem happens when a log backup is restored that contains a long-running transaction that does not commit before the end of the log backup. This means that it must be completely undone as part of restoring the log backup (step 4), resulting in a very large undo file. This in itself can make restoring a log backup take a lot longer than usual. When the next log backup is restored, the additional step that undoes all the log records in the undo file has a very large undo file to process and takes much, much longer than usual. And if the log backup being restored also has an uncommitted, long-running transaction then it’s the perfect storm as the step 4 will also take a long time. These steps are all made even longer still if the log file has too many VLFs (called VLF fragmentation).

The situation where I’ve seen this most often is when the primary database is undergoing index maintenance and a log backup finishes near the end of a very long-running index rebuild operation of a large clustered index. The initial restore of that log backup on the secondary database takes much longer than usual to complete because of step 4 in the restore process. The next log backup on the primary also completes just before an index rebuild completes. When it is restored on the secondary, the whole of the large undo file has to be undone again, then the log restore occurs, and then another large undo file is generated to undo the second uncommitted index rebuild.

This is a possibility you have to be aware of if the secondary database must be available 24×7 for reporting, with only minimal downtime when each log backup is restored. In that case I would carefully augment the index maintenance operations on the primary with log backups to ensure that only complete, committed index rebuilds are present in the log backups being restored on the secondary database. Similar precautions should be taken if you have other, occasional, long-running operations.

An alternative would be to move from log shipping to database mirroring or availability groups, where the log records are continually being sent from the principal to the mirror database (or primary to secondary replica databases, in availability group terms) and there are no extra steps involving undoing log operations multiple times. With database mirroring, the drawback of this is that reporting would have to use database snapshots, so there’s a complexity trade-off involved. With availability groups, the reporting would have to use a readable secondary, which can lead to index fragmentation on the primary replica, but that can be compensated for with index fill factors (see here for more details).

So there you have it. Another example where understanding how SQL Server performs common operations can make it much easier to diagnose performance problems.