5mins of Postgres E34: Making the most of Parallel VACUUM and why SQL/JSON was pushed out to Postgres 16
In this episode, we're walking through parallel VACUUMing in Postgres 13 and newer. We're also going to talk about the unfortunate fact that SQL/JSON won't be in Postgres 15.
Let's dive right in!
Starting with parallel VACUUM, this blog post by the database team from AWS goes into more details as to how that works and which configuration settings you have to tune to make the most of it.
Generally, when you look at VACUUMing, there are four different phases that matter.
There is a really short initialize phase at the beginning of
VACUUM. Then, there is a pretty lengthy scan phase where
VACUUM scans your table and finds the rows that were deleted or updated, and where the old version of the row can now be removed.
Once it knows this information, it can then VACUUM each index on the table and identify, for example, which index entries are no longer needed because the underlying table row was removed. And last,
VACUUM goes through a cleanup phase where it actually cleans up all the indexes, removes dead entries and updates the heap. In some cases it also reduces the size of the table by truncating it when there are empty pages at the end of the table.
VACUUM is important, but as you can see, it's quite complex. One of the challenges of
VACUUM is that on busy tables and busy databases it's oftentimes too slow. If you have a really large table it just takes a long time.
Parallel VACUUM tries to optimize the index vacuum phase. Basically, when
VACUUM processes these different indexes on the table, with the parallel setting, it can now process that in parallel. Note that this doesn't apply to autovacuum, it only applies to regular
Parallel VACUUM only applies to regular VACUUM, not autovacuum!
This is something that you can control through a few parameters, which ultimately define how much resources you make available to this new parallel
VACUUM. It uses the built-in Postgres background worker processes for that.
The first important setting to note here is the
max_worker_processes setting. This setting determines on the whole system, how many processes can be launched to support parallel or other types of background operations. This is something that you would also use for parallel query, and parallel query and parallel
VACUUM share within the same pool of concurrent worker processes.
min_parallel_index_scan_size setting. This setting determines in which case the VACUUM process will start a parallel background worker instead of just doing it in the regular process that it's running in. There's always a cost to launching a new process. If you have very small tables, very small indexes, it's not really worth it.
And last but not least, you also have the
max_parallel_maintenance_workers setting. This setting determines the maximum number of parallel workers that can be started by a single utility command. For example, a parallel
VACUUM or also previously a parallel
Now, to test this, in the AWS blog post they create some test indexes (you need to have more than one index for this to work) and then, with the "\timing" option, they run two
VACUUMs. One with just the regular
VACUUM command. This does not use the parallel settings. Even if you enable the parallel settings on the server side, it's not enough, you actually explicitly need to tell
VACUUM that you want it to run in parallel.
You can specify
VACUUM, and then you say
PARALLEL, for example, "4", and this means that it will run up to four parallel processes during the index vacuum phase. On this particular
VACUUM, the regular
VACUUM takes a hundred milliseconds. The parallel vacuum takes 83 milliseconds.
That's a decent improvement. I would say this should be more dramatic if you have a table with a lot of indexes. Certainly, if you create a lot of different partial indexes this can quickly add up and become very useful.
Certainly, if you create a lot of different partial indexes this can quickly add up and become very useful.
Next, I wanted to talk about the SQL/JSON feature. We've previously talked about SQL/JSON here on 5mins of Postgres episode 13 and depesz, who also wrote about this previously on his blog, unfortunately noted last week that SQL/JSON is postponed.
These things can happen when Postgres releases are developed. Currently, Postgres 15 is in a QA phase as it goes through different beta versions. Unfortunately, the release management team, together with the community has come to the conclusion that this feature should be reverted in Postgres 15, and we should try again in Postgres 16.
There's a longer conversation here on what exactly happened, but my understanding and quick summary would be that some members of the community were doing performance testing and they noted that Postgres 15 was slower than older releases for particular workloads that they were testing with.
What they found was that the way that SQL/JSON made modifications in the executor component of Postgres caused a lot of overhead for certain workloads. This would not just make the new feature slow, but also have a potential impact to other workloads that were using those same structs in the executor. There wasn't really a good way forward to keep this feature, but maybe add a little caveat to it, because it would have impacted other types of workloads and the release management team and the community briefly discussed whether there was maybe a smaller set of features that they could get committed, but ultimately the time to the release is too short.
This actually gave me confidence. Oftentimes, there's a discussion whether you should adopt the first release when a new major version of Postgres comes out, or whether you should wait for the first patch release. As you can see, the community is doing a lot of testing ahead of the release throughout the different beta versions. Overall, I think it's a good choice that this was removed and hopefully we can see it in Postgres 16.