How to optimize correlated subqueries in Postgres
In episode 76 of “5mins of Postgres”, we are going to talk about how to optimize subqueries in Postgres by understanding the Postgres planner better. We look at correlated vs. uncorrelated subqueries, as well as scalar subqueries vs. tabular subqueries.
Let's dive in!
This is a blog post by Laurenz Albe on the Cybertec blog. In this post, Laurenz describes the different ways of how subqueries are used in a query and what that means for optimization decisions the planner makes.
What I liked about this post was that Laurenz found a good set of terms of how to describe how a subquery is used in the query and what that means for performance.
To start with, what we mean by subquery is a portion of a query where you put something in parentheses and you have an outer query and then in that inner query, you're doing a sub-SELECT on a particular table that might reference data from outside of that sub-SELECT.
The most important aspect of subqueries is that there's what you call correlated subqueries and uncorrelated subqueries.
An uncorrelated subquery is a query that does not reference anything else from the outside, versus a correlated subquery is when the query that's inside that sub-SELECT is referencing something from the outside. We're selecting from a table
a, and then in the target list of that query for table
a, we're also selecting from a table
b and we're referencing a column of the original table
a. Obviously, you first have to get the data from
a, to then be able to do the lookup on the table
b. That's a correlated subquery. If you don't do any reference to the outside, then that's an uncorrelated subquery.
This is a really important difference because when uncorrelated subqueries are encountered by the Postgres planner, then Postgres will not pull this up into the overall query plan, it will actually treat it separately from the main query. When this happens, what you will see in your
EXPLAIN plan is what's called an
InitPlan. This is essentially an initial plan that's done independently of the rest of the query that then, for example, could be used to do some other lookups. But that subquery itself does not require the result set from the main query.
Uncorrelated subqueries are almost never a performance problem. And the reason is because they're usually executed just once.
What Laurenz describes in his blog post is the difference between a scalar subquery and a tabular subquery. A scalar subquery is when the query returns just a single data point, a single value. For example, when we say
WHERE and we're matching the number of elements, so we're saying
SELECT COUNT(*) FROM something in a subquery, then that's going to be a single value that's being returned, a single row being returned. That's a scalar subquery.
On the other hand, a *tabular subquery would be when you are referencing something that returns multiple rows.
For example, you could use a tabular subquery in a
FROM list where you're returning multiple rows, common table expressions (CTEs are a good example of this) and also IN lists or EXISTS or NOT EXISTS expressions are also good examples of that.
And different from what we said earlier about uncorrelated subqueries, scalar correlated subqueries are actually performance problems.
As Laurenz states, you should usually avoid them if you can. The reason for it is pretty simple. If you have a scalar subquery, it has to be implemented as a Nested Loop because of the way Postgres executes it. For each of the rows that it's encountering, it will have to evaluate that correlated scalar subquery.
That means that if you are having a lot of results from an earlier table scan, you will have to run your scalar subquery for each of those results. You can imagine, that quickly becomes very slow.
One technique that Laurenz describes here is to focus on rewriting these scalar subqueries explicitly into a
JOIN. When you don't have it in a sub-SELECT but you actually write it as a
JOIN, Postgres will be able to use other join techniques besides Nested Loops. In Laurenz’ example, we are actually able to have a left join which is semantically equivalent and actually gives you a similar result. There are some edge cases that you have to watch out for, which is why the planner doesn't do this automatically. But, if you are rewriting your query, this is usually a safe transformation to do. In the second case that Laurenz shows, you can also do this with a
GROUP BY clause and
HAVING, which will also let Postgres choose a better join strategy.
An interesting case is when we're looking at
NOT IN. There is a variant of
IN where you're saying: “this column
IN” and then you have a list of values that you're getting from that subquery.
Postgres will generally be able to rewrite this in a way where it is going to be more efficient, for example by the condition that you are joining on being pulled into the sub-SELECT.
If you write
NOT IN, then Postgres will not be able to do this type of pull up. It is important that if you are finding yourself to use
NOT IN, to instead use
NOT EXISTS, because
NOT EXISTS will perform better.
Last but not least, Laurenz mentions an interesting aspect. A lot of his article is about avoiding the forced nested loop that a correlated subquery will usually produce. However, sometimes you want to make sure Postgres uses a nested loop. Of course, you could use techniques like using
pg_hint_plan to force the optimizer to always do a nested loop join.
But, one of the things you can do is force a nested loop join by writing a lateral cross join. That will always be done as a nested loop by Postgres - today at least. One of the things where I find this useful is when you want to make sure to get a parameterized index scan, meaning you want to make sure that you are able to use an index on the lookup on that second table you're joining with. If that’s the case, then it is very useful to be able to force a nested loop.