Avoid Postgres performance cliffs with MultiXact IDs and foreign keys
In episode 55 of 5mins of Postgres we talk about performance cliffs encountered with MultiXact transaction IDs. We’ll also look at foreign key relationships and the Postgres VACUUM and autovacuum process.
Let's have a look!
We'll start with this blog post by Christophe Pettus who writes about a foreign key pathology to avoid, and this involves foreign keys causing MultiXact problems in Postgres.
He starts by describing a system that's kind of like Twitch, where we have a table called
stream, a table called
user and a table called
stream_viewer. Stream viewer has a
stream_id and a
user_id. The idea is: when somebody starts viewing a stream, we keep track of that in the
We are using foreign keys here, which oftentimes is a good practice in Postgres for consistency reasons. But here they do cause a performance problem that is surprising. Let's say a very popular streamer onboards to their platform and they start streaming. Suddenly, 1.2 million people want to watch that stream (that is probably a small stream on Twitch these days). When these viewers log on and they start following the stream, the way that this data model is set up, we'll do inserts on the stream viewer table. That, in itself, inserting 1.2 million records at the exact same time will cause some contention in Postgres.
However, that's not really the issue. The big issue is that we have all these records that have a foreign key relationship to a single record in the stream table. What Postgres does when you have a foreign key like this: it will hold a
FOR KEY SHARE lock on the stream row in this case, on that single row that we are referencing.
Postgres holds this lock so that the stream row, in this case, doesn't just disappear whilst we're inserting the stream viewers. That's important to guarantee consistency of the foreign key relationship. However, the way that Postgres keeps track of these locks is what causes the issues here.
When more than one transaction holds a lock on a single record, a single row, then the MultiXact system in Postgres handles that. MultiXacts are assigning a special transaction ID to the row that's being locked and then there's an external data structure that keeps track of all the multiple transaction IDs that are actually referencing that record. The row is a fixed size structure, so you can't just keep adding these IDs to that local row information. That's why there's a separate data structure that keeps track of this.
When more than one transaction holds a lock on a single record, a single row, then the MultiXact system in Postgres handles that.
This generally works quite all right, but the problem is that once you have a high number of transaction IDs that are doing this it can become a big problem as it can fill up the local cache and spill it to a secondary storage.
That will be slow, and what you will see is that suddenly a lot of connections will wait on these MultiXact lightweight locks, which is what will show in your wait events view or if you look at
The way that Christophe recommends you deal with this type of situation is that you actually reconsider whether you should have foreign keys. You may actually opt to not have these same consistency requirements based on this workload, because foreign keys in a situation like this are a problem.
The other thing that Christophe recommends, which I think makes a lot of sense, is that you think about when transaction IDs are assigned. If you are able to batch things together, maybe you can make it one big transaction or several big transactions instead of 1.2 million transactions. Because that's what's causing the problem here – that you're doing single inserts instead of batch inserts.
Now, you may think that this is an obscure problem, but there is a popular post from Nelson Elhage that also was referenced in a GitLab post-mortem quite a while ago, where Nelson had described that he had ran into a problem on a production database where he suddenly saw performance spiking and performance behaving quite oddly.
First of all, he saw the query latency on the database spiking and throughput just plummeted to essentially nothing. The database would essentially seem to lock up. He looked at
pg_stat_activity and other tools and he saw LWLock and then that MultiXact was the wait event type. This meant that he saw a MultiXact problem, he just didn't really know what it was.
After more investigation he discovered the mentioned MultiXact behavior in Postgres. An important thing that I would add is that VACUUMing is also important.
What Nelson notes is that the MultiXact ID that's being assigned is an actual number. Postgres actually has to clean up those numbers that are written to the rows in the table at some point. That occurs with the Postgres VACUUM and autovacuum process.
The MultiXact ID that's being assigned is an actual number.
You can see more about this in the Postgres documentation where it talks about MultiXacts and how the same way that Transaction ID wraparound is a problem, MultiXact ID wraparound is also a problem. You may be surprised to see an anti-wraparound vacuum on your database, and if you're using MultiXacts, it could actually be a MultiXact anti-wraparound, not a regular transaction ID anti-wraparound. It's something to keep a close eye on.
If you know you're using a lot of MultiXact IDs you can track that the same way you would track a transaction ID wraparound type problem. There is also the
autovacuum_multixact_freeze_max_age setting, as noted by the Postgres documentation, which controls when this freezing, that happens for MultiXact reasons, actually occurs. Something to watch out for if you're using a lot of row level locks. In that case tune some of these settings or change your workload if you can.