Never run VACUUM FULL: How to run pg_repack on Amazon RDS and Aurora

By Lukas Fittl

February 09, 2023

In today’s E54 of 5mins of Postgres we talk about why you shouldn't use VACUUM FULL and how to use the alternative, pg_repack, on Amazon RDS and Aurora. We walk through pg_repack’s benefits, how to install it, and how it helped making one of our tables over 10x smaller.

Share this episode: Click here to share this episode on twitter, or sign up for our newsletter and subscribe to our YouTube channel.

Why you should use pg_repack over VACUUM FULL in Postgres
Problems with VACUUM FULL in Postgres
Scenarios where you think you need VACUUM FULL, but you don’t
The benefits of using pg_repack
Installing pg_repack on Amazon RDS and Aurora
Running pg_repack on AWS RDS and Aurora
What we have discussed in this episode of 5mins of Postgres

Transcript

Let's take a look!

Why you should use pg_repack over VACUUM FULL in Postgres

In this blog post by depesz he talks about when to use VACUUM FULL. He says: the short answer is never. You should not use VACUUM FULL. At all.

For those who don't know, VACUUM FULL is essentially a more aggressive form of a regular VACUUM. It's something that Postgres itself would not run on its own but rather something that you would have to run explicitly.

Problems with VACUUM FULL in Postgres

People do actually run VACUUM FULL, and, as depesz notes here, people put VACUUM FULL calls in things like cron jobs, or they run it manually when they feel it is the right moment; and there really isn't a good moment to run VACUUM FULL. The reason for this is that it takes an Access Exclusive Lock for the table.

There really isn't a good moment to run VACUUM FULL. The reason for this is that it takes an Access Exclusive Lock for the table.

It doesn't just take that lock for a moment, it takes that for the full time of that VACUUM FULL! Which, for a large table, could be multiple hours. That is a big problem because during that time nothing else can access the table:

No SELECTs
no INSERTs
no UPDATEs
no DELETEs

This is different from a regular VACUUM, regular VACUUM you can run and the workload can still operate in parallel.

This is due to the fact that VACUUM FULL rewrites the whole table and the indexes. And, again, that just takes time. The question is: what should I do if I think I have a use case for VACUUM FULL?

VACUUM FULL rewrites the whole table and the indexes. And, again, that just takes time.

Scenarios where you think you need VACUUM FULL, but you don’t

depesz describes that there are generally two situations where people think that they need VACUUM FULL:

They think that autovacuum doesn't handle everything as it should
They have very bloated tables and they want to remove the bloat from the table

Regarding the first, people think that they have to do better than what autovacuum is doing and they have to fix it because autovacuum missed something – and so they think VACUUM FULL is going to make things better.

Regarding the second, people want to run something that actually makes the table smaller, which most of the time a regular VACUUM does not.

As depesz notes: if you have an autovacuum issue, really, you should go and try to debug your autovacuum settings. If autovacuum does not work for you, what you should do is run a normal VACUUM.

Some people I know run daily VACUUMs, but they don't run the full VACUUM, they just run a regular VACUUM at night. That's not going to lock anything, it's not going to block anything, it's just going to make sure that cleanup work is done consistently when autovacuum can't trigger things correctly.

If you have bloat on a table, which usually would be caused by having an autovacuum configuration that's not optimal, there isn't really as easy a way to fix it. VACUUM FULL is the only way in core Postgres to really clean things up.

The benefits of using pg_repack

Now, pg_repack is a utility you can use that is not part of standard Postgres, you actually have to install it as an extension.

This is something that has been pretty well tested over the years. I have run it myself on a production database without problems, and I've heard many people have good success with it. The main benefit of pg_repack is that it rewrites the table without taking long locks. It doesn't impact production workloads in the same way. It's pretty amazing.

The way it works, in short is: You create the extension and then you run pg_repack. The challenging part about running pg_repack in the cloud, when you have a managed database, is that you actually have to install not just the extension, but you also have to install the pg_repack utility.

Installing pg_repack on Amazon RDS and Aurora

The team at AWS actually wrote about how to do this a little bit over a year ago. The referenced post is a somewhat long article which describes the benefits and also says: "well, you shouldn't run VACUUM FULL".

Quick note here: if you're wondering, do I need this on Aurora? Yes, you do! Bloat can still happen on Amazon Aurora. Aurora still has VACUUM. There is a reason that AWS made this available for both, regular AWS RDS and Aurora.

Let’s continue: first of all, you create the extension pg_repack, which is fully supported on Amazon RDS and Aurora. Later, you need to have a client machine. If you're used to a managed service, you actually have to run pg_repack as a tool.

The way you would do this is to fire up an EC2 instance and then get the pg_repack binary installed, either building it from source or the Postgres Yum repository.

Be mindful: It has to be the exact same version of pg_repack. This is challenging because, for example, Amazon RDS doesn't always update to the latest patch release of all the extensions.

I was running pg_repack a couple of weeks ago and I had to go in and use an older version of the pg_repack binary to match what the Amazon RDS database was using. This is something to watch out for, it's going to take some time to get this installed correctly.

Running pg_repack on AWS RDS and Aurora

But once you do, you can run pg_repack against a particular table. It's pretty straightforward. You can also read about the other options on the pg_repack page. Really, the main option is that you give it a table and then it starts doing its work. You want to make sure that you run this from a long-lived connection, so in a "screen" or "tmux" session on an EC2 instance, and have it running for an hour or two, or however long it takes depending on the size of the table.

To conclude why you would want to do this: Personally, I've experienced, we had a table that was 490 gigabytes in size, but after we ran pg_repack, it was reduced to just 10 gigabytes. We had over 400 gigabytes of bloat on that table, which ultimately was caused by a combination of some bad UPDATE statements throughout multiple weeks, and autovacuum settings not being aggressive enough.

pg_repack helped us improve the performance significantly, making that table over 10x smaller!

Thanks for checking out episode 54 of 5mins of Postgres. Subscribe to our YouTube channel, sign up for our newsletter and follow us on LinkedIn and Twitter to get updates about new episodes!