How partition-wise joins and aggregates improve query performance
In E81 of “5mins of Postgres” we're talking about how partition-wise joins can improve your Postgres query performance. We also look at partitionwise-aggregates and their impact on query planning time and query execution time.
Let's jump in!
This is a blog post by Ashutosh from the EDB team where he ran a benchmark that's essentially inspired by TPC-H. It compares how the queries behave if you have
- no partitioning enabled,
- have partitioning enabled with no partition-wise join,
- have partition-wise join enabled.
There are two queries that Ashutosh is testing, which are joining two tables. Notably, they're not doing any partition pruning. If you recall, when we previously discussed partitioning performance in 5mins of Postgres, one of the concerns was that if there's no partition pruning, then oftentimes partitioning ends up being more expensive.
Let's see if that holds true in Ashutosh’s benchmark. In his article, he is testing different amounts of partitions. What's important to note is that it's the same number of rows for each partition, but different number of partitions in each test. For example, he's testing with five partitions, with 10,000 rows per partition, then he's testing 10, and so essentially in the first case, it's 50,000 rows, 100,000 rows and such, up to 1000 partitions, which means we have 10 million rows in that last test.
Looking at his blog post, you can see here, on the left side is the smallest test case and the right side is the biggest. You can also see that the yellow line, the lowest line that's best performing, is with partition-wise join enabled. You can also see that the unpartitioned and the partitioned case, otherwise, when you don't have partition-wise join enabled, are very close. You don't really see much of a performance difference when you're querying all of them most of the time.
Although, as you can see on the right in one of his graphs, for the case where we have the highest amount of data, the partitioning will actually slightly win out. What that shows me is that because there's no partition pruning going on, you're not restricting the search space because you're searching for a particular ID or anything. You're really, in the worst case, looking at all the data, especially when you don't have partition-wise join enabled.
Ashutosh provides a few takeaways at the end. To me, the most interesting one is: when tables are partitioned, queries perform better when partition-wise operations are used, irrespective of the data size. And this benchmark does confirm that.
Now, you might be surprised - if you haven't heard about these partition-wise operations before - that they're actually turned off by default. So in Postgres, there's a couple of relevant settings.
First of all, there's
enable_partition_pruning. This one makes a huge difference if you're passing a particular ID or a particular timestamp, and the Postgres planner is able to determine that certain partitions do not match those expressions, and so it can either during planning time or right at the start of execution time, eliminate particular partitions that are just not matched. That makes a lot of sense, and that's turned on by default.
What's turned off by default is
enable_partitionwise_aggregate. Those two essentially have to do with work that the Postgres planner does to do joins between individual partitions or do aggregates on individual partitions before they're grouped together in an Append node.
Because this is expensive and uses significantly more CPU and memory during planning, the default is off. This applies to both of them. What that means is, most likely you could be benefiting from a performance improvement here by having these on, but you just don't know because the planner doesn't even try it!
Ashutosh talked a little bit more about partition-wise join back in 2017, when it was first introduced in Postgres 11. partition-wise join, you can imagine it when you look at a particular query plan: what it does is that instead of the Append nodes coming first and then the join sitting on top of them, you will see that the join essentially moves underneath the Append nodes and runs multiple times. Further, Postgres may actually choose to use different join techniques between different partitions, depending on their size and their data distribution.
It's important to note that in order to be able to use partition-wise join, the join condition needs to include all the partition keys. So this doesn't always work. It only works in very few select conditions. Postgres 13 allowed partition-wise join to work in more cases. Essentially, it came down to a slight difference here in terms of how it checked that two partition tables are the same and can be joined in this way. It moved from being exactly the same set of child partitions to being able to match one to one sets of child partitions. But you still need to include all the partition keys in your join condition for this to work.
On the note of partition-wise operations, there's one more blog post I want to mention, which is a post by Adrien on his personal blog back in 2021. This blog post has a few interesting aspects, but I want to jump just to the partition-wise operation here. So what Adrien shows here is the comparison between partition-wise aggregate being turned off, the default, and it being turned on.
What you can see here in the first case, you have a planning time of 0.5 milliseconds, an execution time of 5,200 milliseconds. Now you turn on partition-wise aggregate. What happens is that the query cost actually goes down, and the planning time goes up. But most importantly, the execution time goes down. So if you're adding planning time and execution time, clearly we're now in better shape than before. We're saving about 600 milliseconds, compared to the previous case where we didn't have partition-wise aggregate turned on.
And really the problem here is that when Postgres starts its planning process, it just doesn't know. For some frequently running queries, adding a millisecond of planning time could be disastrous if you're calling it a lot, and especially if you have a lot of partitions, this may not be just a millisecond, it might be multiple milliseconds, and so that's the reason why this is turned off by default.
But, I would encourage you, if you have cases where you're doing a lot with partitioned tables, it's worth benchmarking both, partition-wise joins and partition-wise aggregates, to see if they can give you performance benefit.
You can enable these settings on a per connection basis, no need to change them for the whole server. Just call a
SET statement as shown in Adrien’s blog post and then the query will be slightly slower in planning, but hopefully a lot faster in execution.