As Kimberly blogged about earlier this year, SQLskills has an ongoing initiative to blog about basic topics, which we’re calling SQL101. We’re all blogging about things that we often see done incorrectly, technologies used the wrong way, or where there are many misunderstandings that lead to serious problems. If you want to find all of our SQLskills SQL101 blog posts, check out SQLskills.com/help/SQL101.
We’re teaching a class this week, and one topic that always comes up is using an int identity as a clustering key and the possibility of running out of integers. Depending on your insert volume, this might be quite likely, as an int can only store 2^32 (^ = ‘to the power’) or about 4 billion values, between -2^31 and 2^31-1.
Imagine that you have a theoretical system that can create a thousand data rows per second. Using an int identity value increasing by 1 and starting at 1, you’ll run out of values when the value hits 2^31-1 and tries to insert the next value. Let’s simplify the math by just saying that 2^31 is the limit. With a thousand values per second, that would mean 2^31 / 1,000 = 2.15 million seconds or just under 25 days. While many of you don’t sustain 1,000 rows per second, this is still a very problematic limitation.
One solution is to use a bigint identity as the key. This can store 2^64 or about 18.5 quintillion (18.5 billion billion) values, between -2^63 and 2^63-1.
Every so often someone asks whether it’s possible to run out of bigint values. My answer is no. Well, technically yes, there is a limit, but in practical terms the answer is no.
Now imagine that you have a theoretical system that can create a million data rows per second, with a bigint identity value increasing by 1 and starting at 1. You’ll run out of values when the value hits 2^63-1 and tries to insert the next value. With a million values per second, that would mean 2^63 / 10^6 = 9.2 trillion seconds or approximately 292.5 thousand years. And by then it’s someone else’s problem… :-) And that’s only for half the possible range of bigint values.
Now what about the storage for those values? Doing a quick test of a heap with a single bigint identity column shows me that I can get 453 rows per 8KB data file page (don’t forget the record overhead, slot array overhead, and that the heap pages won’t be filled completely because of the way the free space caching and searching works). A terabyte of data would store roughly 61 billion rows.
So with 1 million rows per second, you’ll be generating 1 million x 3,600 (seconds in an hour) x 24 (hours in a day) = 86.4 billion rows per day, so you’ll need about 1.4 terabytes of new storage per day. If you’re using the bigint identity as a cluster key, each row needs new space, so you’ll need almost exactly 0.5 petabytes of new storage every year.
At that rate, actually running out of bigint values AND storing them would take roughly 150 thousand petabytes. This is clearly impractical – especially when you consider that storing *just* a bigint is pretty pointless – you’d be storing a bigint and some other data too – probably doubling the storage necessary, at least.
Why is this interesting? We’ve had a number of clients over the years that didn’t consider their data volume and designed a schema using int keys instead of bigint keys. When the inevitable happened and they ran out of int values, the process of changing to a bigint key was quite painful – as there’s no really easy, space and log efficient way to do it once you have the 2 billion rows, and especially if constraints are involved, and application changes need to be made to allow 8-byte values instead of 4-byte values in result sets.
A common stop-gap solution people use when they run out of int values is to just reset the identity seed to -2^31 and then set the increment to 1. As a short-term solution this does work, especially if the int key is a surrogate key and doesn’t have a business meaning, but it’s not ideal as a long term solution as you’ll only run out again once the int key kits -2^31. Ultimately, you’ll need to make the int to bigint change.
Summary: make sure that when you’re designing a new schema, you think through the maximum values required and pick appropriate data types then and there. Changing data types can be very painful once the system has been in production for a while and there’s a lot of data in the schema.
PS If you honestly believe you’ll run out of bigint values, you can use a decimal or numeric value, both of which can hold -10^38 to 10^38+1. Those are really big numbers. 10^ 38 is about 2^129, or 100 undecillion, or 2^64 times more values than a bigint can hold. Using our million-row-per-second server, inserting 10^38 values would take 10^38 / 10^6 = 10^32 seconds = roughly 3,170 billion billion years. Now if you’re concerned about *that*, the sun will have become a red giant and incinerated the Earth in about 5 billion years…