5mins of Postgres E34: Making the most of Parallel VACUUM and why SQL/JSON was pushed out to Postgres 16

By Lukas Fittl

September 08, 2022

In this episode, we're walking through parallel VACUUMing in Postgres 13 and newer. We're also going to talk about the unfortunate fact that SQL/JSON won't be in Postgres 15.

Share this episode: Click here to share this episode on twitter, or sign up for our newsletter and check out the newsletter archive and subscribe to our YouTube channel.

VACUUM and Parallel VACUUM in Postgres
- Postgres VACUUM: The 4 phases
Parallel VACUUM in Postgres
Understanding VACUUM parameters in Postgres
Comparing Postgres VACUUM and parallel VACUUM performance
SQL/JSON is not in Postgres 15
- SQL/JSON slowing down specific workloads in Postgres 15
What we have discussed in this episode of 5mins of Postgres

Transcript

Let's dive right in!

VACUUM and Parallel VACUUM in Postgres

Starting with parallel VACUUM, this blog post by the database team from AWS goes into more details as to how that works and which configuration settings you have to tune to make the most of it.

Postgres VACUUM: The 4 phases

Generally, when you look at VACUUMing, there are four different phases that matter.

Initialize
Scan
VACUUM
Cleanup

There is a really short initialize phase at the beginning of VACUUM. Then, there is a pretty lengthy scan phase where VACUUM scans your table and finds the rows that were deleted or updated, and where the old version of the row can now be removed.

Once it knows this information, it can then VACUUM each index on the table and identify, for example, which index entries are no longer needed because the underlying table row was removed. And last, VACUUM goes through a cleanup phase where it actually cleans up all the indexes, removes dead entries and updates the heap. In some cases it also reduces the size of the table by truncating it when there are empty pages at the end of the table.

VACUUM is important, but as you can see, it's quite complex. One of the challenges of VACUUM is that on busy tables and busy databases it's oftentimes too slow. If you have a really large table it just takes a long time.

Parallel VACUUM in Postgres

Parallel VACUUM tries to optimize the index vacuum phase. Basically, when VACUUM processes these different indexes on the table, with the parallel setting, it can now process that in parallel. Note that this doesn't apply to autovacuum, it only applies to regular VACUUM.

Parallel VACUUM only applies to regular VACUUM, not autovacuum!

Understanding VACUUM parameters in Postgres

This is something that you can control through a few parameters, which ultimately define how much resources you make available to this new parallel VACUUM. It uses the built-in Postgres background worker processes for that.

The max_worker_processes Postgres VACUUM parameter

The first important setting to note here is the max_worker_processes setting. This setting determines on the whole system, how many processes can be launched to support parallel or other types of background operations. This is something that you would also use for parallel query, and parallel query and parallel VACUUM share within the same pool of concurrent worker processes.

The min_parallel_index_scan_size Postgres VACUUM parameter

Next, the min_parallel_index_scan_size setting. This setting determines in which case the VACUUM process will start a parallel background worker instead of just doing it in the regular process that it's running in. There's always a cost to launching a new process. If you have very small tables, very small indexes, it's not really worth it.

The max_parallel_maintenance_workers Postgres VACUUM parameter

And last but not least, you also have the max_parallel_maintenance_workers setting. This setting determines the maximum number of parallel workers that can be started by a single utility command. For example, a parallel VACUUM or also previously a parallel CREATE INDEX.

Comparing Postgres VACUUM and parallel VACUUM performance

Now, to test this, in the AWS blog post they create some test indexes (you need to have more than one index for this to work) and then, with the "\timing" option, they run two VACUUMs. One with just the regular VACUUM command. This does not use the parallel settings. Even if you enable the parallel settings on the server side, it's not enough, you actually explicitly need to tell VACUUM that you want it to run in parallel.

You can specify VACUUM, and then you say PARALLEL, for example, "4", and this means that it will run up to four parallel processes during the index vacuum phase. On this particular VACUUM, the regular VACUUM takes a hundred milliseconds. The parallel vacuum takes 83 milliseconds.

That's a decent improvement. I would say this should be more dramatic if you have a table with a lot of indexes. Certainly, if you create a lot of different partial indexes this can quickly add up and become very useful.

Certainly, if you create a lot of different partial indexes this can quickly add up and become very useful.

SQL/JSON is not in Postgres 15

Next, I wanted to talk about the SQL/JSON feature. We've previously talked about SQL/JSON here on 5mins of Postgres episode 13 and depesz, who also wrote about this previously on his blog, unfortunately noted last week that SQL/JSON is postponed.

These things can happen when Postgres releases are developed. Currently, Postgres 15 is in a QA phase as it goes through different beta versions. Unfortunately, the release management team, together with the community has come to the conclusion that this feature should be reverted in Postgres 15, and we should try again in Postgres 16.

There's a longer conversation here on what exactly happened, but my understanding and quick summary would be that some members of the community were doing performance testing and they noted that Postgres 15 was slower than older releases for particular workloads that they were testing with.

SQL/JSON slowing down specific workloads in Postgres 15

What they found was that the way that SQL/JSON made modifications in the executor component of Postgres caused a lot of overhead for certain workloads. This would not just make the new feature slow, but also have a potential impact to other workloads that were using those same structs in the executor. There wasn't really a good way forward to keep this feature, but maybe add a little caveat to it, because it would have impacted other types of workloads and the release management team and the community briefly discussed whether there was maybe a smaller set of features that they could get committed, but ultimately the time to the release is too short.

This actually gave me confidence. Oftentimes, there's a discussion whether you should adopt the first release when a new major version of Postgres comes out, or whether you should wait for the first patch release. As you can see, the community is doing a lot of testing ahead of the release throughout the different beta versions. Overall, I think it's a good choice that this was removed and hopefully we can see it in Postgres 16.

Thanks so much for listening. This was today’s episode of 5mins of Postgres. Subscribe to our YouTube channel and follow us on Twitter if you want to hear about next week's episode.