5mins of Postgres E12: The basics of tuning VACUUM and autovacuum
Today, we're going to talk about tidying up with VACUUM. Now, as the Postgres Weekly newsletter put it, we can't go a month without covering VACUUM in some way or another. If you recall a few weeks ago, we last talked about "dead tuples not yet removable" and what is essentially a very specific VACUUM problem that causes VACUUM to stall and not make progress. You can find the episodes about this here: 5mins of Postgres E7: Autovacuum, dead tuples not yet removable, and the Postgres xmin horizon
Let's have a look at today's episode!
This post by Shaun Thomas is a high level view on all the settings that are relevant for tuning autovacuum. I would highly recommend it if you are not yet familiar with what are the most important knobs that you can set in Postgres. Shaun talks about a lot of details here. Let's dive in!
He first talks about releasing the throttle and the main aspect to consider here is that autovacuum has a particular speed that it's operating at. The idea is that autovacuum is a background process, so it shouldn't slow down your database whilst it's running.
Postgres historically has been too conservative about this. Postgres is quite old by now and it was initially designed for much slower hardware. Postgres 12 actually made a change in this regard. Postgres 12 increased the default of
autovacuum_vacuum_cost_delay, which was previously limiting to less than 10 megabytes of disk reads per second, to 10 times the value, for running autovacuums.
If you're on Postgres 12, you probably don't need to tweak this, but if you're on an older Postgres version, then I would highly recommend just raising this to the Postgres 12 default, which is two milliseconds instead of 20 milliseconds to have your VACUUMs complete faster.
Additionally, what Shaun recommends is also looking at the
autovacuum_vacuum_cost_limit setting. I've personally not tuned this as often, but it's certainly something that you can consider.
Overall, as you make these changes, watch your IO performance. If you get this wrong, what you will see is that VACUUM takes too much IO and then your queries might be stalled because there's not enough capacity available!
Autovacuum has a certain number of workers that it uses to perform its job. Each worker works on one table at a time. If you have a very small number of tables you probably don't need to change this, ever, because there's no parallelization that could happen here. The reason you might want to increase this is if you have a lot of tables and we're talking thousands or tens of thousands of tables.
This is a setting that applies to your whole database server. If you have a lot of databases, it also makes sense to increase this setting. The way to monitor this is you can look at either
pg_stat_progress_vacuum, and those views show you how many vacuums are running right now.
If you often see this being fully exhausted, then it makes sense to increase this.
Thresholds define when autovacuum runs. The main reason why autovacuum kicks off on a table is the
autovacuum_vacuum_scale_factor. This scale factor says that at least this percentage of the table has to have changed for autovacuum to run.
The default here is 20% or 0.2, which means that until 20% of the table have changed and there is 20% of dead rows, autovacuum doesn't kick off. If you have a very large table, that can sometimes be a problem because it takes a long time to reach this number. There's an additional setting called the
autovacuum_vacuum_threshold, which is additive to the scale factor. Oftentimes, you wouldn't necessarily tweak this if you are using the scale factor, but there's a neat trick here where you can say the scale factor is zero and then you set the threshold to a particular number.
Instead of using a percentage based triggering mechanism, it triggers based on an absolute number of changed rows. In a large table, if most of the table is not changing ever, but there's a small portion of the table that's being updated, then this can help you make sure that VACUUM runs.
Important to note: all these settings are tuneable on a per table basis. Especially in this example of a large table I wouldn't want to modify the server-wide setting, I would want to modify the specific table setting.
Last but not least, there's an aspect to consider here about Transaction ID wraparound. Postgres tries to do a special kind of vacuum to make sure that transaction IDs are reusable and frozen so that you don't reach the point where your database reaches Transaction ID wraparound, which is ultimately where your database would shut down because no new transaction IDs can be assigned.
There is a setting that may be worth tuning, and the reason you would want to tune it is because the default is a rather low 200 million. Once that default of 200 million transactions is reached, Postgres will kick in a more aggressive way of running VACUUM, and it will essentially VACUUM all your tables, even if they haven't changed based on the thresholds. That can cause a lot of churn on the VACUUM activity which you might not want.
Shaun has personally managed a database that had 30,000 transactions per second, a billion transactions during a busy eight hour day! If the setting was the default of 200 million, they would have gone vacuuming all the tables five times throughout the day, which is pretty ineffective.
That's why it makes sense to raise that on a busy database. If you do raise it, don't raise it above a billion and be careful with getting too close to the shutdown time.