How to optimize correlated subqueries in Postgres

In episode 76 of “5mins of Postgres”, we are going to talk about how to optimize subqueries in Postgres by understanding the Postgres planner better. We look at correlated vs. uncorrelated subqueries, as well as scalar subqueries vs. tabular subqueries.



Share this episode: Click here to share this episode on twitter, or sign up for our newsletter and subscribe to our YouTube channel.


Transcript

Let's dive in!

How Postgres subqueries are used in a query

This is a blog post by Laurenz Albe on the Cybertec blog. In this post, Laurenz describes the different ways of how subqueries are used in a query and what that means for optimization decisions the planner makes.

What I liked about this post was that Laurenz found a good set of terms of how to describe how a subquery is used in the query and what that means for performance.

What is a subquery?

To start with, what we mean by subquery is a portion of a query where you put something in parentheses and you have an outer query and then in that inner query, you're doing a sub-SELECT on a particular table that might reference data from outside of that sub-SELECT.

The most important aspect of subqueries is that there's what you call correlated subqueries and uncorrelated subqueries.

Correlated subqueries vs. uncorrelated subqueries

An uncorrelated subquery is a query that does not reference anything else from the outside, versus a correlated subquery is when the query that's inside that sub-SELECT is referencing something from the outside. We're selecting from a table a, and then in the target list of that query for table a, we're also selecting from a table b and we're referencing a column of the original table a. Obviously, you first have to get the data from a, to then be able to do the lookup on the table b. That's a correlated subquery. If you don't do any reference to the outside, then that's an uncorrelated subquery.

This is a really important difference because when uncorrelated subqueries are encountered by the Postgres planner, then Postgres will not pull this up into the overall query plan, it will actually treat it separately from the main query. When this happens, what you will see in your EXPLAIN plan is what's called an InitPlan. This is essentially an initial plan that's done independently of the rest of the query that then, for example, could be used to do some other lookups. But that subquery itself does not require the result set from the main query.

Uncorrelated subqueries are almost never a performance problem. And the reason is because they're usually executed just once.

Download Free eBook: How To Get 3x Faster Postgres

Scalar subqueries vs. tabular subqueries

What Laurenz describes in his blog post is the difference between a scalar subquery and a tabular subquery. A scalar subquery is when the query returns just a single data point, a single value. For example, when we say WHERE and we're matching the number of elements, so we're saying SELECT COUNT(*) FROM something in a subquery, then that's going to be a single value that's being returned, a single row being returned. That's a scalar subquery.

On the other hand, a *tabular subquery would be when you are referencing something that returns multiple rows.

For example, you could use a tabular subquery in a FROM list where you're returning multiple rows, common table expressions (CTEs are a good example of this) and also IN lists or EXISTS or NOT EXISTS expressions are also good examples of that.

And different from what we said earlier about uncorrelated subqueries, scalar correlated subqueries are actually performance problems.

As Laurenz states, you should usually avoid them if you can. The reason for it is pretty simple. If you have a scalar subquery, it has to be implemented as a Nested Loop because of the way Postgres executes it. For each of the rows that it's encountering, it will have to evaluate that correlated scalar subquery.

That means that if you are having a lot of results from an earlier table scan, you will have to run your scalar subquery for each of those results. You can imagine, that quickly becomes very slow.

Rewriting scalar subqueries explicitly into a JOIN

One technique that Laurenz describes here is to focus on rewriting these scalar subqueries explicitly into a JOIN. When you don't have it in a sub-SELECT but you actually write it as a JOIN, Postgres will be able to use other join techniques besides Nested Loops. In Laurenz’ example, we are actually able to have a left join which is semantically equivalent and actually gives you a similar result. There are some edge cases that you have to watch out for, which is why the planner doesn't do this automatically. But, if you are rewriting your query, this is usually a safe transformation to do. In the second case that Laurenz shows, you can also do this with a GROUP BY clause and HAVING, which will also let Postgres choose a better join strategy.

How Postgres deals with IN vs. NOT IN

An interesting case is when we're looking at IN and NOT IN. There is a variant of IN where you're saying: “this column IN” and then you have a list of values that you're getting from that subquery.

Postgres will generally be able to rewrite this in a way where it is going to be more efficient, for example by the condition that you are joining on being pulled into the sub-SELECT.

If you write NOT IN, then Postgres will not be able to do this type of pull up. It is important that if you are finding yourself to use NOT IN, to instead use NOT EXISTS, because NOT EXISTS will perform better.

A technique to force Postgres to use a nested loop

Last but not least, Laurenz mentions an interesting aspect. A lot of his article is about avoiding the forced nested loop that a correlated subquery will usually produce. However, sometimes you want to make sure Postgres uses a nested loop. Of course, you could use techniques like using pg_hint_plan to force the optimizer to always do a nested loop join.

But, one of the things you can do is force a nested loop join by writing a lateral cross join. That will always be done as a nested loop by Postgres - today at least. One of the things where I find this useful is when you want to make sure to get a parameterized index scan, meaning you want to make sure that you are able to use an index on the lookup on that second table you're joining with. If that’s the case, then it is very useful to be able to force a nested loop.

Thanks for joining us for E76 of 5mins of Postgres. Subscribe to our YouTube channel, sign up for our newsletter and follow us on LinkedIn and Twitter to get updates about new episodes!

What we have discussed in this episode of 5mins of Postgres


Enjoy blog posts like this?

Get them once a month to your inbox