Postgres performance with IN vs ANY
In E75 of “5mins of Postgres” we're going to talk about a surprising case Kaarel Moppel ran into, where using "IN" is faster than "ANY". We investigate this edge case in detail and are also looking at Tom Lane's response, who is one of the main authors of a lot of Postgres planner logic, to Kaarel's case.
Share this episode: Click here to share this episode on twitter, or sign up for our newsletter and subscribe to our YouTube channel.
Transcript
Let's have a look!
A surprising performance problem with single-element lists for IN vs ANY
This is a blog post by Kaarel Moppel on his personal blog, where he talks about the difference between using IN
in a Postgres query versus using ANY
in a Postgres query.
Just for reference, this is when you're querying a table and in the WHERE
clause, you're either using a particular column IN
a set of values or you're saying a particular column = ANY
, and then a particular array of values.
Generally speaking, ANY
is more performant. And thank you, Kaarel, for the shout out to our previous 5mins of Postgres episode, where we talked about the performance difference between IN
and ANY
. Back then, my general recommendation was that you should be using ANY
, and I haven't changed that recommendation, but Kaarel ran into a surprising case where ANY
is actually worse performing than IN
.
Why using ANY is generally giving better performance than using IN in Postgres
Let's do a quick repeat of why we think that generally ANY
is a good choice. First of all, especially in older Postgres versions before Postgres 14, IN
lists actually perform worse than ANY
. This comes down to how IN
lists are processed. IN
lists are parsed individually for each element versus an ANY
list with an array can be parsed much more efficiently, essentially a single value that then later on gets transformed.
There are a couple other benefits, but one thing that I've found quite useful is when you're looking at your query statistics, an ANY
list will just show up as a single entry in pg_stat_statements
versus IN
will show up as one entry for each different amount of IN
entries you have.
Speaking of the case where you have a single element in your IN
list or your ANY
list, that is really what this edge case is all about.
An edge case with 90,000x difference in performance
What Kaarel saw in a production customer case is that there was a five orders of magnitude difference - a 90,000x difference, for this seemingly harmless syntax change and IN
was the option that won out here. IN
was that much faster.
That's really surprising, because as a general best practice we would expect ANY
to be faster. Why is IN
faster here? It really comes down to how the Postgres parser handles this.
Let's jump into the actual bug report that Kaarel submitted to the pgsql-bugs mailing list.
The perfect Postgres bug report
I want to start by saying that this is the way to do it. If you ever have to report a bug to Postgres, do exactly what Kaarel did here. Create a reproducer and share the reproducer with the mailing list. There's nothing more frustrating when somebody comes to the mailing list and they're like, "My database is slow. Here's a query." Maybe, if people are generous, they include an EXPLAIN
plan, but unfortunately there is no way for somebody working on Postgres to run this locally. They can’t reproduce the situation without having access to your customer database, which of course you couldn't just share on a public mailing list. Thankfully, in this case, Kaarel included a full reproducer that shows the problem and you could just run this yourself.
You can see in the bug report that two tables being joined, there is a WHERE
clause that says “this project_id equals either ANY
or IN
”. And then there's an ORDER BY
statement that also does an order by that same column that's being filtered by as well as another column.
There is also an index that supports this ORDER BY
condition. In many cases it makes sense to use the index. If you had a lot of elements, the reasonable plan here is to first get all the matching values on a
, and then use the output of that to then find the right set of values on b
.
Now, that is not an efficient plan if you just have a single element. If you just have a single project_id
you're searching for, it's actually more efficient to first look at that b
table, to find all the b
s, in this case 15,000. This is a much smaller number than 8 million on the other table, and then an index only scan on a
.
It makes a lot of sense why in this particular case one is faster than the other. The question is, why couldn't we always be using that second plan?
If you're lucky when you report a Postgres bug, the people actually working on the code will respond back to you.
The intricacies of the Postgres planner
In this case, Tom Lane, who's one of the main authors of a lot of the Postgres planner, especially the more intricate pieces of the Postgres planner logic, responded by first of all saying: “don't assume that it's guaranteed that IN
and ANY
will produce the same plan. It is actually permissible for Postgres to choose a different plan.”
What's really happening here is that this "project_id IN (1)" gets turned into "project_id = 1".
That's done by the parser, so very early in the process. Then, when the planner sees that operator expression with a single value, it's actually able to say “you're ordering by this project id, but really you just have a single value for project id. That's a no-op. I don't actually have to respect that order.” Only because of that ORDER BY
being ignored, it's permissible to use that second faster plan. That wouldn't of course work if you had two elements.
It's a very particular case where that early optimization in the parser switches it over to an =
statement and then gets you this optimization.
As Tom says, there is no deduction like this for ANY
clauses. Unfortunately, it's also not really possible to assume this, because an array will be processed later, and so it's not possible for the parser to do that same transformation. As Tom says in his response to the bug report, he's also not sure about the cost benefit ratio of adding special code just to check for this behavior.
It's unfortunate, but it is an edge case. You will have to search for a problem like this to happen, but of course if it does happen. It's quite frustrating. Maybe this changes, sometimes there is more conversation on these threads. But, in the meantime, if you are running into this problem for some reason, meaning you are getting a bad plan for a single element list, and an IN
list performs better and you have that ORDER BY
in particular, then it might be worth adding logic on the application site to say “if I'm just querying for a single value, use an =
operator, if I'm querying for multiple values, do"= ANY
with an array.” That way you will get the best performance in both cases.
Thanks for checking out E75 of 5mins of Postgres. Subscribe to our YouTube channel, sign up for our newsletter and follow us on LinkedIn and Twitter to get updates about new episodes!