5mins of Postgres E7: Autovacuum, dead tuples not yet removable, and the Postgres xmin horizon
Today, we'll talk about autovacuum, dead tuples not yet removable, and what the xmin horizon in Postgres does.
Share this episode: Click here to post to twitter, or sign up for our newsletter and check out the newsletter archive and subscribe to our YouTube channel.
Let's get started!
We'll start with this blog post by Mathieu Tamer from the Precogs team back from 2018. The reason I looked at this blog post again is because it was talking about a production issue that a friend ran into last week.
What that friend had mentioned is that they encountered this
dead tuples not yet removable problem, and it quite puzzled them. We'll start looking at the Precogs blog post first, which gives a bit of a background of what that problem means. In their situation, the team at Precogs encountered a problem with their Amazon RDS database. They saw that really simple queries kept taking a really long time. Queries that were expected to finish fast took five minutes. The whole system was showing slowdowns. They suspected that there was a problem with vacuuming and so they looked at
pg_stat_activity and they saw that one of their tables, the opportunities table, was vacuuming a lot. Every 10 minutes that table was being vacuumed!
When they turned on more logging they also learned that that vacuum itself took 10 minutes. The vacuum ran for 10 minutes and then finished, but then started again a new vacuum. It kept continuously vacuuming the exact same opportunities table. What autovacuum is supposed to do is that it avoids tables growing unnecessarily. If a table does not have autovacuum running, it will have to keep growing to store new data somewhere because if data is being deleted then it can not be reused unless autovacuum runs.
A second good best practice that they referenced in their article is that if your autovacuum takes a long time it's oftentimes because it doesn't run often enough.
So, running autovacuum more often is generally a good thing if the problem is that your vacuum runs too long. They did tune various autovacuum settings, but what they realized is it didn't work at all. It did not help. Changing all these settings did not resolve their issue. The database was still slow and they still had this continuously running autovacuum on that one table.
Now, they pulled in some additional help and they looked at the autovacuum log output.
When you have autovacuum problems, look at the logs, they can tell you a lot.
In this particular case, they did notice that autovacuum couldn't remove dead tuples while it was working. There is a specific section where it says how many tuples were removed and how many tuples are remaining in the table. Tuples are kind of like rows, you could say. Row versions. In their case, 26,000 tuples were dead but not yet removable. Zero removed tuples meant autovacuum didn't do anything really.
Now, what does that mean, "not yet removable"?
Not yet removable means that autovacuum was not able to clean it up because the data that the tuple contained was still visible to an open transaction. There was still something open in the system that could have looked at that tuple. Even though it was deleted there was still something that had a right to look at the tuple.
In this particular situation, what they found was that the ORM, ActiveRecord, kept some really long operations running in a transaction. Their application code contained long blocks in Ruby, essentially. They removed some of them or reduced the length of these transactions. They also reduced the number of updates. Both of these changes in this situation resolved the problem. Now, what they did in this blog post with solving long running transactions, it doesn't always help. And when I was talking with that friend of mine about their production issue they actually said that they followed all these guidelines, but they still didn't have vacuum running successfully.
Now, there's this other blog post by Laurenz Albe from Cybertec that goes into more details of why autovacuum cannot remove dead rows from the table. Funnily enough, it's also from 2018! Apparently that's a good year for Postgres blog posts.
In his blog post, he describes that row versions cannot be removed yet. In his article you can see that in
vacuum verbose it is actually more visible with the oldest
xmin showing. You can go look for which object in the system is holding back the vacuum. The reason that the vacuum cannot remove those rows, as mentioned earlier, is because it could still be that that process that has that old
xmin still has the rights to see those tuples, until that operation finishes. This is what you call the
xmin horizon, which is essentially the oldest transaction ID that's still running and has rights to see these tuples so they can't be removed. There are three things that can hold back transactions.
The first thing is long running transactions. A long running transaction keeps open the xmin horizon, from the transaction start. If you run this query, that helps you find these long running transactions.
The second thing and I've personally had forgotten that this was a thing is that replication slots also hold back the
xmin horizon. You wouldn't see it if you look at
pg_stat_activity. But you would have to check
pg_replication_slots, and that's going to show you the
xmin for that specific replication slot. This replication slot is the reason why the
xmin horizon cannot advance. And this was actually the problem that that friend of mine had: they had a leftover replication slot from their HA tooling and that's why it wasn't successfully autovacuuming their tables.
There's also a situation with prepared transactions. If you use two phase commit, it’s really important to clean up prepared transactions, either commit them or roll them back because if they're left open, then they too can hold back the
xmin horizon and stop autovacuum.
Remember those three things. If you have problems with autovacuum and "dead tuples not yet removable", make sure to
- check your long running transactions
- your replication slots and
- your prepared transactions.
Thank you so much for listening. This was five minutes of Postgres. Subscribe to our YouTube channel if you want to hear about new episodes. I'll talk to you next week!