5mins of Postgres E1: Using Postgres statistics to improve bad query plans, pg_hint_plan extension
Hello, and welcome to 5 minutes of Postgres!
Today, we are starting a new weekly video series where we'll walk through interesting articles about Postgres from the last week, as well as evergreen blog posts from the past, and provide context and our personal perspective on it. We are planning to publish a new episode of this series every Thursday, so keep an eye out!
We'll always embed the video first, but below it you can find a transcription in case you want to read through what we discussed in the video and check out the mentioned links. Let's get started:
Today, we'll talk about bad query plans, how to use Postgres statistics to improve them, and if that doesn't work, what options exist to hint a particular query plan so that Postgres uses a different plan over what it would usually use.
Finding and re-writing slow Postgres query, using CREATE STATISTICS in Postgres to improve the query plan, and using the pg_hint_plan extension
Now, this was prompted by this article on the Spacelift blog where Jacob Martin said they tricked Postgres into using a particular query plan which made their query 200 times faster.
That's obviously a bold claim.
Before we jump into that it is important to note that they use Aurora Serverless and sometimes Aurora can have different behavior, especially when it comes to storage.
On the particular query that they've had the issue with, they are doing a
SELECT and you can see that, really, it's the sub-SELECT that they had a problem with, which caused a bad plan for them. They had a table called runs and that table had a lot of inactive rows and a very small set of active rows. They saw this was a bad plan. It took 20 seconds. That's not good enough in their eyes. Then, they were trying to understand what exactly was slow. As you can see further down in their article it came down to a slow index scan that had a pretty generic condition: A
worker_id IS NOT NULL index condition was taking a lot of time to scan the index and looked over a lot of rows.
Ultimately, what they were particularly concerned with was an overestimation that Postgres made. Postgres thought there were 2 million rows, but there were only 42 rows. That's clearly a bad estimation. What's interesting is what they ended up doing. They rewrote the query and made a pretty significant change in how the query works. They essentially put a sub-SELECT inside the sub-SELECT, and then were able to get down from 20 seconds to 100 milliseconds. Interestingly enough, the resulting query plan still does the same index scan, but the order is different so it's presumably able to limit the amount of the index it has to look at to only a few rows and then Postgres is able to determine that it has enough data and can terminate this scan early. The conclusion here is they had to change the query and now it's fast, and the rewrite actually was pretty straightforward.
There was a lot of discussion on this and I would say there's two camps that I see in the Hacker News thread:
There's the one group of people that recommend using query plan hints. Postgres will create bad plans and you need to make sure that if the database does something wrong, you need to be able to override that. I think there's some merit to that and we'll get back to that.
The other thing that I do want to highlight is somebody said Postgres probably doesn't know the correlation between two different specific columns. A lot of times, you have to remember that Postgres is estimating. Postgres is making an assumption based on the data in your table and based on the
ANALYZEprocess it will create certain statistics. These statistics are what the planner uses to determine how to plan your query. If those assumptions are wrong, because for example Postgres just doesn't collect the necessary statistics then that will cause mis-estimates and bad plans.
There is a good blog post by affinity where Jared Rulison used the extended statistics system in Postgres to collect more statistics and, even more impressive, created a 3000x speed up. I'll skip over most of this post in this summary, I recommend reading it though. But in short, they - just like Spacelift- had a bad estimate as well. They ultimately used
CREATE STATISTICS in Postgres on the
songs table and
artist_id as they wanted to collect additional dependency information. If the
artist_id has a certain value and the
album_id has a certain value, Postgres remembers that those are associated. That helps tremendously if you have correlated values in your table.
If you encounter a query planning problem, look at the estimates, look at the statistics. Use
CREATE STATISTICS to create additional statistics that improve the query plan.
Now, that made me curious though, if we want to force query plans, what are the options?
Really, the main option that exists today, short of some very basic ways to do it in core Postgres, is the pg_hint_plan extension. It is available on RDS and Aurora as well.
The way that
pg_hint_plan works is that you have to add special comments to your query to indicate how you want the query implemented. For example, you tell it that, between the tables A and B, you want it to use a
HashJoin and to scan table A, you want it to use a sequential scan. The
pg_hint_plan extension essentially runs as a planner hook and overrides what the planner would usually do and instead ends up taking a different query plan. This works. Keep in mind, there is a little bit of overhead because it has to do a little bit of extra work with the way it hooks into the planner.
Generally, if you really have a need to hint particular query plans though, this is a good way to do it in Postgres.