Handling integer sequence overflow without downtime in Postgres
In episode 58 of 5mins of Postgres we're going to talk about handling integer sequence overflow in Postgres without downtime. We show how you can avoid integer overflow and how you can fix it once you've ran into it.
Share this episode: Click here to share this episode on twitter, or sign up for our newsletter and subscribe to our YouTube channel.
Transcript
Let's jump in!
What happens when a sequence reaches max value in Postgres?
In this blog post, Jesse Soyland from the Crunchy Data team describes how they've helped their customers navigate problems around integer overflow.
What we mean by integer overflow is that we have a data type like a four byte integer that has a maximum of 2.1 billion values, and once you have a sequence that keeps counting up and it reaches that 2.1 billion value, it actually will start erroring out.
You might see one of two errors:
- The
nextval: reached maximum value of sequence
error. This is when the sequence itself cannot assign more because the sequence maximum has been reached. - The
integer out of range
error. This will happen when a sequence could technically give you more values, but the data type in the physical table is too small. For example, again, you've reached that 2.1 billion mark, and the next value that would be assigned would go outside of the four byte range for the integer.
Before we jump into how to fix the situation, let's take a step back and see how we can avoid running into this in the first place.
How to avoid integer sequence overflow in Postgres
In his blog post, Jesse shows a useful query. It checks the database for this type of problem, and it looks at the last sequence values using the pg_sequence_last_value
function, and then says: “is the last value close to the maximum of that particular data type?” Jesse's query has two outputs it produces:
- A sequence percent, which refers to how much more space there is in the sequence itself, so the sequence maximum.
- A column percent, this is the physical data type in the table, and how much more space there is to have values assigned.
How to fix Postgres integer overflow
Let's say this happens to you and you're either at 93% or maybe you're at 99%. I've actually had issues in the past myself where I was in a car and suddenly got paged and I had to actually take my laptop out - I wasn't driving - and I actually had to fix something, as production was down because suddenly new integer values could no longer be assigned. So you see, very much a problem that can happen in real life and in real world situations!
Using negative numbers
When this happens to you, when you're in a pinch, the best fast fix that I've seen is using negative numbers. Admittedly, this is a bit of a hack. The sequence usually would count up. Now, what you're doing is to tell the sequence to instead count down and to have the next value be minus one and then count down from minus one to negative 2.1 billion. You could, in most cases, just run this one command and the problem will be at least - in that very moment - fixed and your application will be online again.
Of course, the downside is that you suddenly have negative values as your IDs. So if, downstream in your application, you're parsing integers and you're not expecting the sign in front of the number, that could be a problem. It might also not look aesthetically pleasing in your URL structure, for example. So, it's not great. But it's the fastest way to fix this right away.
So, it's not great. But it's the fastest way to fix this right away.
Using ALTER COLUMN
How do we actually fix this, for real? If you did ALTER COLUMN
, you’d just change the datatype, that would rewrite the whole table. There's no way to do that online, you would have an exclusive lock on the table whilst that rewrite is happening.
Now, the one thing I want to point out different from what Jesse's article describes is that if your problem is the sequence itself, so not the data type on the table, but just the sequence maximum value, you can actually change this on the sequence, so you could technically raise the range of a sequence and then fix the data type of the table separately. Essentially, the sequence is a generator of new numbers and then the integer or bigint on the table - that's separate. You don't necessarily have to make a new sequence.
Using a new column
In terms of the data type: Jesse describes an approach on how to do this using a new column and then backfilling that column, and applying the change. Once you've created your new column you can alter the existing column to something like id_old
, and then you have an "id_new" column, and then you switch that over so that to the application it would look the same. If you're not down yet, then this doesn't cause downtime as it doesn't take an exclusive lock.
Is integer sequence overflow actually a realistic problem?
There are a few other things I want to mention. Two quick notes on: "Is this actually going to happen in real life?" There are two articles I want to highlight.
First, this article by Jake Swanson from 2017. In it, he describes a particular situation that they ran into where it was essentially “the night the Postgres IDs ran out”. Alarms went off. 10:50 PM, and then they're like, "huh, we haven't created a new row since 10:30", for about 20 minutes. What's going on? It turns out they ran into this exact issue back then and he describes how to fix this, with either ALTER COLUMN
, or having a separate table and writing into that.
The other case where you may see this problem happen is when you have an old Rails application. Rails, back before Rails 5.1 was actually defaulting to integer primary keys, not bigint. This was a big issue for people. You started out with Rails, Rails just gave you the defaults of a regular integer four byte primary key, and then your application got popular. Then the problem was that you would run out of the sequence values, you'd run out of the integer range. This has been fixed as of Rails 5.1, so modern Rails applications shouldn't suffer from that. But if you have an old schema, you may see that problem because of that.
Lastly, this is another post that talks about the exact same problem. It is by Radan from the Silverfin team. He mentions one important aspect that I haven't seen in the other posts, which is that you also have to think about foreign keys. If you have other tables that are referencing your primary table through a foreign key type relationship you have to also update those data types. What Radan describes is an approach using triggers, which I think is a great way to solve this. So, when you're writing the foreign key id, you're also writing to that new column, and then you're switching it over.
This is a strategy that, again, only works if you're not yet in a pinch. The negative hack is the only way to get out of it at that moment. But if you know you're getting close to it, then applying the approach with a second column is the best way to solve this.
Thanks for checking out episode 58 of 5mins of Postgres. Subscribe to our YouTube channel, sign up for our newsletter and follow us on LinkedIn and Twitter to get updates about new episodes!