Notice: We've updated our Privacy Policy, effective November 14, 2024.

Waiting for Postgres 17: Faster B-Tree Index Scans for IN(...) lists and ANY =

In today’s E111 of “5mins of Postgres” we discuss faster B-tree index scans in Postgres 17 for queries that involve IN lists or other cases where multiple array values are being passed to Postgres (ScalarArrayOpExpr). We show how even simple cases now avoid repeated page access, and how turning filters into index conditions and processing like an Index Skip Scan can yield significant speedups for certain queries.



Share this episode: Click here to share this episode on LinkedIn or on X/Twitter. Feel free to sign up for our newsletter and subscribe to our YouTube channel.


Transcript

We'll start with this commit by Peter Geoghegan with co-author Matthias van de Meent, that recently landed in Postgres 17.

There's been a lot of back and forth over the last months in terms of testing, benchmarking and revising the approach. So I think the chances are pretty good that this will stay in the final release, but as always with these changes before September or October this year, we won't know for sure if this will actually land.

Improving the handling of ScalarArrayOpExpr (SAOP) in B-tree index scans

The commit message talks about what are called scalar array operator expressions, aka ScalarArrayOpExpr or "SAOP":

Teach nbtree SAOP index scans to decide how to execute ScalarArrayOp scans (when and where to start the next primitive index scan) based on physical index characteristics. This can be far more efficient.

In practical terms, you could say ScalarArrayOpExpr is an IN list, where you have multiple fixed values that are being provided by the client that Postgres will look up in the index in order to find the result. Same goes for an " column = ANY" with a list of values or other operators like NOT IN, or the ALL operator.

Now, one of the main ideas here is that this takes the previous commit by Tom Lane, back in 2011, a lot further, which initially enabled B-tree indexes to be smarter about index scans and index only scans in relation to these ScalarArrayOpExpr.

The first improvement here is that an SAOP scan will now reliably avoid duplicative leaf page access:

Reducing duplicative leaf page access

Previously, before this commit, Postgres would repeatedly traverse the B-tree index just to potentially get to the exact same page again, because it wasn't able to work with multiple values at a time. The idea is that now this array of input values is considered when we're working with a particular page, and so we can potentially get the next value without having to re-traverse the whole tree. That's a pretty big improvement just in terms of repeated page accesses.

Before we jump into an example of that, I also want to highlight to you one important documentation change that clarified something that I didn't know before, which is pg_stat_all_indexes, which is one of the internal statistics tables, has an idx_scan counter, same counter exists also per table. And Peter, as part of this commit, added a clarification note, where he said, queries that have these arrays of multiple scalar values can perform multiple "primitive" index scans:

Queries that use certain SQL constructs to search for rows matching any value out of a list or array of multiple scalar values (see Section 9.25) perform multiple “primitive” index scans (up to one primitive scan per scalar value) during query execution. Each internal primitive index scan increments pg_stat_all_indexes.idx_scan, so it's possible for the count of index scans to significantly exceed the total number of index scan executor node executions.

Now, what does "primitive" mean here? It essentially means that you're accessing the _bt_first function in the B-tree code multiple times, that's essentially where it starts a new traversal of this B-tree structure.

And so the idea is previously, if you had an IN list with multiple values, for each of the values, it would increment the index scan counter, which means the index scan counter actually means how often this _bt_first function was executed, which started a traversal of the B-tree index. Now in Postgres 17, we'll see this be lower in many cases because of these changes.

Testing the difference between 16 and 17

Let's take a look here. I have Postgres 16 running as well as the latest development build of Postgres 17. I'll create a simple test table here. And I'll just insert 10,000 values into this test table:

CREATE TABLE test(id SERIAL PRIMARY KEY);
INSERT INTO test SELECT * FROM generate_series(1, 10000);

Now I'll run an EXPLAIN (ANALYZE, BUFFERS) on that table with 7 values on Postgres 16:

EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM test WHERE id IN (1, 2, 3, 4, 5, 6, 7);
                                                    QUERY PLAN                                                     
-------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on test  (cost=30.05..49.90 rows=7 width=4) (actual time=0.051..0.053 rows=7 loops=1)
   Recheck Cond: (id = ANY ('{1,2,3,4,5,6,7}'::integer[]))
   Heap Blocks: exact=1
   Buffers: shared hit=15
   ->  Bitmap Index Scan on test_pkey  (cost=0.00..30.05 rows=7 width=0) (actual time=0.029..0.030 rows=7 loops=1)
         Index Cond: (id = ANY ('{1,2,3,4,5,6,7}'::integer[]))
         Buffers: shared hit=14
 Planning:
   Buffers: shared hit=1
 Planning Time: 0.355 ms
 Execution Time: 0.110 ms
(11 rows)

And you'll see here that Postgres tells me that there's 15 "shared hit" pages.

Now, let me check how many index scans I did:

SELECT idx_scan, idx_tup_fetch FROM pg_stat_user_tables WHERE relname = 'test';
 idx_scan | idx_tup_fetch 
----------+---------------
        7 |             7
(1 row)

We can see that here, we had 7 of these "primitive" index scans, of the _bt_first function being called.

Now, let's take a look at Postgres 17:

                                                    QUERY PLAN                                                     
-------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on test  (cost=30.05..49.90 rows=7 width=4) (actual time=0.019..0.020 rows=7 loops=1)
   Recheck Cond: (id = ANY ('{1,2,3,4,5,6,7}'::integer[]))
   Heap Blocks: exact=1
   Buffers: shared hit=3
   ->  Bitmap Index Scan on test_pkey  (cost=0.00..30.05 rows=7 width=0) (actual time=0.011..0.011 rows=7 loops=1)
         Index Cond: (id = ANY ('{1,2,3,4,5,6,7}'::integer[]))
         Buffers: shared hit=2
 Planning:
   Buffers: shared hit=2
 Planning Time: 0.092 ms
 Execution Time: 0.031 ms
(11 rows)

And the first change you'll notice is that here instead of 15 pages being hit, only have 3 pages being hit. More importantly, you can see that the index scan counter is just 1 now instead of 7:

SELECT idx_scan, idx_tup_fetch FROM pg_stat_user_tables WHERE relname = 'test';
 idx_scan | idx_tup_fetch 
----------+---------------
        1 |             7
(1 row)

This shows that fundamental change where now, because the individual index scan already can process everything at once, it doesn't need to do the whole work again.

In some test cases where you have a long IN list, even this change alone might sometimes make significant improvements of 20-30%, just if you're CPU bound on these page accesses. It's very workload dependent if such a simple query sees a benefit, but generally speaking there's going to be less contention on these internal buffer locks where you're accessing the same page multiple times.

Better query plans with smarter ScalarArrayOpExpr handling

Now, the big improvement here are what Peter in the mailing list discussion called MDAM techniques. MDAM sends for "multidimensional access methods". And that's based on the 1995 paper "Efficient Search of Multidimensional B-Trees", which introduced a couple of concepts that essentially relate to how do we traverse B-tree index structures.

Back then almost 30 years ago, they were also already talking about IN lists, and they gave one example that might look very much like a query that you run these days:

SELECT date,item,class,store,sum(total_sales)
  FROM sales
 WHERE
       date between '06/01/95' and '06/30/95'
       and item_class IN (20,35,50)
       and store IN (200,250)

So this concept is where Peter's work essentially started.

As Peter describes here, the most compelling use cases for this change is when you have multiple index columns with multiple ScalarArrayOpExpr clauses:

The most compelling cases for the patch all involve multiple index columns with multiple SAOP clauses (especially where each column represents a separate "dimension", in the DSS sense). It's important that index sort be preserved whenever possible, too. Sometimes this is directly useful (e.g., because the query has an ORDER BY), but it's always indirectly needed, on the nbtree side (when the optimizations are applicable at all).

So, this means that you have IN (... list of values ...) and then other_column IN ( ... list of different values ...). Now that's a lot more efficient.

Also additionally, if you're doing scans where you don't scan for the leading column, but you use the leading column, for example, for a sort, Postgres will now be able to do things much more efficiently.

Index Skip Scans

This also relates to discussions around Index Skip Scans. The simplest example here, given in the Postgres Wiki is, imagine you have an index on two columns "order_id" and "product_id" and you're searching for "product_id", but then you're returning the "order_id" to the client:

CREATE INDEX ON orderlines (order_id, product_id);
SELECT order_id,
       product_option
  FROM orderlines
  WHERE product_id IN (2, 11, 23);

So you're getting a benefit from that "order_id" being in an index, maybe other queries are using that, but really your filter is not on that. And so now with this work, this is also going to be more efficient.

A practical example: 3x performance improvement

Now I want to show you one real life example that community member Benoit Tigeot provided to Peter as part of working on this.

And I think this is a great example where Benoit put together a working example that the community can take and use to iterate on a particular patch. If you find yourself reporting something to the Postgres community, please follow what Benoit did here and make a reproducer that Postgres developers can actually run.

So Benoit provided a setup script, I've have already run this locally. And I'm just going to run the "EXPLAIN ANALYZE":

EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS)
SELECT * FROM docs
WHERE status IN ('draft', 'sent') AND sender_reference IN ('Custom/1175', 'Client/362', 'Custom/280') ORDER BY sent_at DESC
NULLS LAST LIMIT 20 OFFSET 0;

I'm just going to show you the difference between 16 and 17.

On Postgres 16, this runs in about 270 milliseconds:

                                                                            QUERY PLAN                                                                            
------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.56..69378.93 rows=20 width=42) (actual time=52.651..270.482 rows=20 loops=1)
   Output: id, type, status, sender_reference, sent_at, created_at
   Buffers: shared hit=1502 read=50857
   ->  Index Scan using docs_sent_at_idx_1 on public.docs  (cost=0.56..103408460.52 rows=29810 width=42) (actual time=52.649..270.469 rows=20 loops=1)
         Output: id, type, status, sender_reference, sent_at, created_at
         Filter: (((docs.status)::text = ANY ('{draft,sent}'::text[])) AND ((docs.sender_reference)::text = ANY ('{Custom/1175,Client/362,Custom/280}'::text[])))
         Rows Removed by Filter: 52032
         Buffers: shared hit=1502 read=50857
 Query Identifier: -3618418932323417527
 Planning Time: 0.528 ms
 Execution Time: 270.518 ms
(11 rows)

On Postgres 17 this runs in an amazing 62 milliseconds:

                                                                              QUERY PLAN                                                                              
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.56..1414.45 rows=20 width=41) (actual time=2.779..62.048 rows=20 loops=1)
   Output: id, type, status, sender_reference, sent_at, created_at
   Buffers: shared hit=281
   ->  Index Scan using docs_sent_at_idx_1 on public.docs  (cost=0.56..2119626.94 rows=29983 width=41) (actual time=2.776..62.039 rows=20 loops=1)
         Output: id, type, status, sender_reference, sent_at, created_at
         Index Cond: (((docs.sender_reference)::text = ANY ('{Custom/1175,Client/362,Custom/280}'::text[])) AND ((docs.status)::text = ANY ('{draft,sent}'::text[])))
         Buffers: shared hit=281
 Planning Time: 0.282 ms
 Execution Time: 62.099 ms
(9 rows)

And one of the main differences here for this particular query: In 16 it was using the index for it's sort order, but it wasn't using the index for filtering directly, essentially applying that after it got the data from the index, versus in 17 it's able to use that directly in the index condition.

And so I think this is a great real world example, where without making any changes to your index structure, just by being smarter about how the index is being used we can see a tremendous 3x performance improvement.

In conclusion

Thank you, Peter and Matthias for working on this change, well done Benoit for providing a good working test case. I'm excited about using this on my own databases. I can already see that I'll have a few cases where this will provide noticeable performance improvements for B-tree index scans.

I hope you learned something new from E111 of 5mins of Postgres. Feel free to subscribe to our YouTube channel, sign up for our newsletter or follow us on LinkedIn and X/Twitter to get updates about new episodes!

What we have discussed in this episode of 5mins of Postgres


Enjoy blog posts like this?

Get them once a month to your inbox