Postgres performance with IN vs ANY
In E75 of “5mins of Postgres” we're going to talk about a surprising case Kaarel Moppel ran into, where using "IN" is faster than "ANY". We investigate this edge case in detail and are also looking at Tom Lane's response, who is one of the main authors of a lot of Postgres planner logic, to Kaarel's case.
Let's have a look!
This is a blog post by Kaarel Moppel on his personal blog, where he talks about the difference between using
IN in a Postgres query versus using
ANY in a Postgres query.
Just for reference, this is when you're querying a table and in the
WHERE clause, you're either using a particular column
IN a set of values or you're saying a particular column
= ANY, and then a particular array of values.
ANY is more performant. And thank you, Kaarel, for the shout out to our previous 5mins of Postgres episode, where we talked about the performance difference between
ANY. Back then, my general recommendation was that you should be using
ANY, and I haven't changed that recommendation, but Kaarel ran into a surprising case where
ANY is actually worse performing than
Let's do a quick repeat of why we think that generally
ANY is a good choice. First of all, especially in older Postgres versions before Postgres 14,
IN lists actually perform worse than
ANY. This comes down to how
IN lists are processed.
IN lists are parsed individually for each element versus an
ANY list with an array can be parsed much more efficiently, essentially a single value that then later on gets transformed.
There are a couple other benefits, but one thing that I've found quite useful is when you're looking at your query statistics, an
ANY list will just show up as a single entry in
IN will show up as one entry for each different amount of
IN entries you have.
Speaking of the case where you have a single element in your
IN list or your
ANY list, that is really what this edge case is all about.
What Kaarel saw in a production customer case is that there was a five orders of magnitude difference - a 90,000x difference, for this seemingly harmless syntax change and
IN was the option that won out here.
IN was that much faster.
That's really surprising, because as a general best practice we would expect
ANY to be faster. Why is
IN faster here? It really comes down to how the Postgres parser handles this.
Let's jump into the actual bug report that Kaarel submitted to the pgsql-bugs mailing list.
I want to start by saying that this is the way to do it. If you ever have to report a bug to Postgres, do exactly what Kaarel did here. Create a reproducer and share the reproducer with the mailing list. There's nothing more frustrating when somebody comes to the mailing list and they're like, "My database is slow. Here's a query." Maybe, if people are generous, they include an
EXPLAIN plan, but unfortunately there is no way for somebody working on Postgres to run this locally. They can’t reproduce the situation without having access to your customer database, which of course you couldn't just share on a public mailing list. Thankfully, in this case, Kaarel included a full reproducer that shows the problem and you could just run this yourself.
You can see in the bug report that two tables being joined, there is a
WHERE clause that says “this project_id equals either
IN”. And then there's an
ORDER BY statement that also does an order by that same column that's being filtered by as well as another column.
There is also an index that supports this
ORDER BY condition. In many cases it makes sense to use the index. If you had a lot of elements, the reasonable plan here is to first get all the matching values on
a, and then use the output of that to then find the right set of values on
Now, that is not an efficient plan if you just have a single element. If you just have a single
project_id you're searching for, it's actually more efficient to first look at that
b table, to find all the
bs, in this case 15,000. This is a much smaller number than 8 million on the other table, and then an index only scan on
It makes a lot of sense why in this particular case one is faster than the other. The question is, why couldn't we always be using that second plan?
If you're lucky when you report a Postgres bug, the people actually working on the code will respond back to you.
In this case, Tom Lane, who's one of the main authors of a lot of the Postgres planner, especially the more intricate pieces of the Postgres planner logic, responded by first of all saying: “don't assume that it's guaranteed that
ANY will produce the same plan. It is actually permissible for Postgres to choose a different plan.”
What's really happening here is that this "project_id IN (1)" gets turned into "project_id = 1".
That's done by the parser, so very early in the process. Then, when the planner sees that operator expression with a single value, it's actually able to say “you're ordering by this project id, but really you just have a single value for project id. That's a no-op. I don't actually have to respect that order.” Only because of that
ORDER BY being ignored, it's permissible to use that second faster plan. That wouldn't of course work if you had two elements.
It's a very particular case where that early optimization in the parser switches it over to an
= statement and then gets you this optimization.
As Tom says, there is no deduction like this for
ANY clauses. Unfortunately, it's also not really possible to assume this, because an array will be processed later, and so it's not possible for the parser to do that same transformation. As Tom says in his response to the bug report, he's also not sure about the cost benefit ratio of adding special code just to check for this behavior.
It's unfortunate, but it is an edge case. You will have to search for a problem like this to happen, but of course if it does happen. It's quite frustrating. Maybe this changes, sometimes there is more conversation on these threads. But, in the meantime, if you are running into this problem for some reason, meaning you are getting a bad plan for a single element list, and an
IN list performs better and you have that
ORDER BY in particular, then it might be worth adding logic on the application site to say “if I'm just querying for a single value, use an
= operator, if I'm querying for multiple values, do
"= ANY with an array.” That way you will get the best performance in both cases.