Introducing Query Tuning Workbooks to safely tune Postgres queries on production with pganalyze!

Postgres Planner Quirks: Incremental Sort, and when it goes wrong

In today’s E120 of “5mins of Postgres” we're returning to our Postgres planner quirks series to talk about Incremental Sort, and when it goes wrong.

Incremental Sort can often speed up query plans when you have an existing sort order; however, there can be edge cases where the planner chooses a sub-optimal plan.



Share this episode: Click here to share this episode on LinkedIn. Feel free to sign up for our newsletter and subscribe to our YouTube channel.


Transcript

To be clear, Incremental sort is a great feature that got added in Postgres 13, that can often speed up query plans when you have an existing sort order, for example from a B-Tree index.

The benefits of Incremental Sort

A simple example might be an index scan that has an existing sort order in the index that Postgres can reuse. And so Postgres only needs to sort by an additional column. There are more complex cases, for example, let's imagine you have a GROUP BY, you have an aggregate function inside your queries, the aggregate function might need a particular order.

We talked previously on 5mins of Postgres about all the improvements that were added in Postgres 16 around Incremental Sort, and how they for example, make DISTINCT queries faster or ORDER BY and DISTINCT aggregates.

However I've recently come across a couple of cases where Incremental Sort actually makes performance worse. And I want to explain to you a little bit when these cases happen, so that you can see them more easily when you upgrade your Postgres.

Now to be clear, this is most likely to affect you if you're upgrading from before Postgres 13, where Incremental Sort was first added, or if you're upgrading to Postgres 16, where there are a couple of improvements regarding Incremental Sort that make it more likely to see an Incremental Sort node in your query plans.

A real-world incident after upgrading to Postgres 15 on Amazon RDS

We'll start with this bug report by Nicolas from Loxodata.

In this example, Nicolas walks us through how a customer of theirs did a migration from 11 to 15 on Amazon RDS and they noticed a degradation in response time on the query:

SELECT inputdocum0_.id AS col_0_0_
  FROM document_management_services.input_document inputdocum0_
 WHERE (inputdocum0_.indexation_domain_id in ('2d29daf6-e151-479a-a52a-78b08bb3009d'))
       AND (inputdocum0_.indexation_subsidiary_id in ('9f9df402-f70b-40d9-b283-a3c35232469a'))
       AND (inputdocum0_.locked_at IS NULL)
       AND (inputdocum0_.locked_by_app IS NULL)
       AND (inputdocum0_.locked_by_user IS NULL)
       AND (inputdocum0_.lock_time_out IS NULL)
       AND inputdocum0_.archiving_state<> 'DESTROYED'
       AND (inputdocum0_.creation_state in ('READY'))
       AND inputdocum0_.active_content=true
       AND (inputdocum0_.processing_state in ('PENDING_INDEXATION'))
 ORDER BY inputdocum0_.created_at ASC,
          inputdocum0_.reception_id ASC,
          inputdocum0_.reception_order ASC
 LIMIT 50;

The query went from a few seconds of runtime to 10 minutes.

They of course did VACUUM ANALYZE, that all seemed good, but then when they did an EXPLAIN ANALYZE, Postgres showed a change of plan.

In Postgres 15 it shows an Incremental Sort, corresponding to the ORDER BY clause on the created_at column:

 Limit  (cost=324.05..16073.82 rows=50 width=44) (actual time=1663688.290..1663696.151 rows=50 loops=1)
   Buffers: shared hit=114672881 read=5725197 dirtied=38564 written=24394
   I/O Timings: shared/local read=1481378.069 write=313.574
   ->  Incremental Sort  (cost=324.05..27838050.13 rows=88375 width=44) (actual time=1663688.289..1663696.144 rows=50 loops=1)
         Sort Key: inputdocum0_.created_at, inputdocum0_.reception_id, inputdocum0_.reception_order
         Presorted Key: inputdocum0_.created_at
         Full-sort Groups: 2  Sort Method: quicksort  Average Memory: 27kB  Peak Memory: 27kB
         Buffers: shared hit=114672881 read=5725197 dirtied=38564 written=24394
         I/O Timings: shared/local read=1481378.069 write=313.574
         ->  Merge Append  (cost=9.08..27834073.25 rows=88375 width=44) (actual time=1663655.490..1663696.068 rows=53 loops=1)
               Sort Key: inputdocum0_.created_at
               Buffers: shared hit=114672878 read=5725197 dirtied=38564 written=24394
               I/O Timings: shared/local read=1481378.069 write=313.574
               ->  Index Scan using input_document_00_created_at_idx on input_document_00 inputdocum0__1  (cost=0.43..1393473.44 rows=4386 width=44) (actual time=71549.638..71550.023 rows=4 loops=1)
                     Filter: ((locked_at IS NULL) AND (locked_by_app IS NULL) AND (locked_by_user IS NULL) AND (lock_time_out IS NULL) AND active_content AND (archiving_state <> 'DESTROYED'::text) AND (indexation_domain_id = '2d29daf6-e151-479a-a52a-78b08bb3009d'::uuid) AND (indexation_subsidiary_id = '9f9df402-f70b-40d9-b283-a3c35232469a'::uuid) AND (creation_state = 'READY'::text) AND (processing_state = 'PENDING_INDEXATION'::text))
                     Rows Removed by Filter: 6142526
                     Buffers: shared hit=5765368 read=249016 dirtied=1845
                     I/O Timings: shared/local read=63290.786
               ...
 Planning Time: 6.280 ms
 Execution Time: 191.048 ms
(110 rows)

Versus in Postgres 11, there was a more efficient index being chosen:

 Limit  (cost=262056.35..262056.47 rows=50 width=44) (actual time=190.836..190.859 rows=50 loops=1)
   Buffers: shared hit=165374 dirtied=1
   ->  Sort  (cost=262056.35..262277.28 rows=88375 width=44) (actual time=190.835..190.848 rows=50 loops=1)
         Sort Key: inputdocum0_.created_at, inputdocum0_.reception_id, inputdocum0_.reception_order
         Sort Method: top-N heapsort  Memory: 31kB
         Buffers: shared hit=165374 dirtied=1
         ->  Append  (cost=0.29..259120.59 rows=88375 width=44) (actual time=5.260..190.514 rows=1078 loops=1)
               Buffers: shared hit=165374 dirtied=1
               ->  Index Scan using input_document_00_indexation_activity_id_indexation_domain_idx1 on input_document_00 inputdocum0__1  (cost=0.29..12844.44 rows=4386 width=44) (actual time=5.259..9.211 rows=56 loops=1)
                     Index Cond: (indexation_domain_id = '2d29daf6-e151-479a-a52a-78b08bb3009d'::uuid)
                     Filter: ((locked_at IS NULL) AND (locked_by_app IS NULL) AND (locked_by_user IS NULL) AND (lock_time_out IS NULL) AND active_content AND (archiving_state <> 'DESTROYED'::text) AND (indexation_subsidiary_id = '9f9df402-f70b-40d9-b283-a3c35232469a'::uuid) AND (creation_state = 'READY'::text))
                     Rows Removed by Filter: 6931
                     Buffers: shared hit=7989
               ...
 Planning Time: 8.371 ms
 Execution Time: 1663700.900 ms
(117 rows)

So we essentially see different indexes being used, in one case with the Incremental Sort node.

ORDER BY + LIMIT + Incremental Sort = wrong index choice

Now this is a variation of something that we've previously talked about, where you have an ORDER BY statement in Postgres, and you have a LIMIT and Postgres will then sometimes use an index that satisfies the ORDER BY condition instead of using an index that handles the WHERE clause.

Postgres does this because it thinks that it's more efficient to use a sorted result and essentially go through the rows. And once it got 50 in this case, it can terminate early.

So it doesn't actually think it needs to use the whole sorted result. Versus in the other case, it has to get all the rows that are matched by the WHERE condition and then do the sort. Now what surprised me and I didn't know before is that Incremental Sort actually affects the costing for cases like this.

A simplified example of when Incremental Sort causes bad plan costs

Richard Guo came up with an example that's a little bit simpler and shows the same problem:

create table t (a int, b int, c int, d int) partition by range (a);
create table tp1 partition of t for values from (0) to (1000);
create table tp2 partition of t for values from (1000) to (2000);

insert into t select i%2000, i%1000, i%300, i from generate_series(1,1000000) i;

create index on t(b);
analyze t;

So here we have a single table that has two partitions in it and Richard generated some data and then he creates two indexes.

Let's step through this example to show how the plans differ between 12 and 15, that I have running locally here. I'm going to start with 12 and what I can see on 12 is that our index condition is used as I would expect it to. Right. So this is the good plan.

explain analyze select * from t where b = 3 order by c, d limit 10;
                                                               QUERY PLAN                                                                
-----------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=2585.46..2585.48 rows=10 width=16) (actual time=2.089..2.092 rows=10 loops=1)
   ->  Sort  (cost=2585.46..2587.94 rows=993 width=16) (actual time=2.087..2.089 rows=10 loops=1)
         Sort Key: tp1.c, tp1.d
         Sort Method: top-N heapsort  Memory: 25kB
         ->  Append  (cost=12.27..2564.00 rows=993 width=16) (actual time=0.203..1.894 rows=1000 loops=1)
               ->  Bitmap Heap Scan on tp1  (cost=12.27..1280.60 rows=497 width=16) (actual time=0.202..0.991 rows=500 loops=1)
                     Recheck Cond: (b = 3)
                     Heap Blocks: exact=500
                     ->  Bitmap Index Scan on tp1_b_idx  (cost=0.00..12.15 rows=497 width=0) (actual time=0.096..0.096 rows=500 loops=1)
                           Index Cond: (b = 3)
               ->  Bitmap Heap Scan on tp2  (cost=12.27..1278.43 rows=496 width=16) (actual time=0.205..0.799 rows=500 loops=1)
                     Recheck Cond: (b = 3)
                     Heap Blocks: exact=500
                     ->  Bitmap Index Scan on tp2_b_idx  (cost=0.00..12.14 rows=496 width=0) (actual time=0.141..0.141 rows=500 loops=1)
                           Index Cond: (b = 3)
 Planning Time: 0.318 ms
 Execution Time: 2.119 ms
(17 rows)

Now if I go and run the same thing on 15, we can see that on 15 this is significantly slower:

explain analyze select * from t where b = 3 order by c, d limit 10;
                                                                  QUERY PLAN                                                                   
-----------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=143.06..571.03 rows=10 width=16) (actual time=94.123..94.125 rows=10 loops=1)
   ->  Incremental Sort  (cost=143.06..42683.17 rows=994 width=16) (actual time=94.121..94.122 rows=10 loops=1)
         Sort Key: t.c, t.d
         Presorted Key: t.c
         Full-sort Groups: 1  Sort Method: top-N heapsort  Average Memory: 25kB  Peak Memory: 25kB
         Pre-sorted Groups: 1  Sort Method: top-N heapsort  Average Memory: 25kB  Peak Memory: 25kB
         ->  Merge Append  (cost=0.85..42646.26 rows=994 width=16) (actual time=4.768..94.075 rows=335 loops=1)
               Sort Key: t.c
               ->  Index Scan using tp1_c_idx on tp1 t_1  (cost=0.42..21318.28 rows=499 width=16) (actual time=2.517..53.560 rows=168 loops=1)
                     Filter: (b = 3)
                     Rows Removed by Filter: 171537
               ->  Index Scan using tp2_c_idx on tp2 t_2  (cost=0.42..21318.03 rows=495 width=16) (actual time=2.250..40.480 rows=168 loops=1)
                     Filter: (b = 3)
                     Rows Removed by Filter: 171534
 Planning Time: 0.242 ms
 Execution Time: 94.146 ms
(16 rows)

Even in the simple example, on 12 we got 2 ms of runtime. On 15 we got 94 ms of runtime.

We can see that in the bad case, an Incremental Sort is chosen and Postgres doesn't actually use the indexes for filtering. Postgres does an Index Scan, but it essentially is not able to restrict how much results it needs to look at in the index, there is no index condition here, it's only used for its order.

And really the main difference here is that on 15, the exact same query uses that index for sorting and then with the Incremental Sort logic essentially incorrectly thinks that this is the cheaper path.

Unfortunately, we can see that even on Postgres 17, this exact same problem still exists:

explain analyze select * from t where b = 3 order by c, d limit 10;
                                                                  QUERY PLAN                                                                   
-----------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=143.03..571.34 rows=10 width=16) (actual time=87.384..87.386 rows=10 loops=1)
   ->  Incremental Sort  (cost=143.03..42673.26 rows=993 width=16) (actual time=87.383..87.384 rows=10 loops=1)
         Sort Key: t.c, t.d
         Presorted Key: t.c
         Full-sort Groups: 1  Sort Method: top-N heapsort  Average Memory: 25kB  Peak Memory: 25kB
         Pre-sorted Groups: 1  Sort Method: top-N heapsort  Average Memory: 25kB  Peak Memory: 25kB
         ->  Merge Append  (cost=0.85..42646.27 rows=993 width=16) (actual time=3.955..87.352 rows=335 loops=1)
               Sort Key: t.c
               ->  Index Scan using tp1_c_idx on tp1 t_1  (cost=0.42..21318.28 rows=495 width=16) (actual time=2.076..49.469 rows=168 loops=1)
                     Filter: (b = 3)
                     Rows Removed by Filter: 171537
               ->  Index Scan using tp2_c_idx on tp2 t_2  (cost=0.42..21318.06 rows=498 width=16) (actual time=1.879..37.860 rows=168 loops=1)
                     Filter: (b = 3)
                     Rows Removed by Filter: 171534
 Planning Time: 1.369 ms
 Execution Time: 87.461 ms
(16 rows)

Turning Incremental Sort off, if needed

You can set "enable_incremental_sort" to off, and rerun the EXPLAIN, and now here on 17, we can get the fast query plan:

SET enable_incremental_sort = off;
explain analyze select * from t where b = 3 order by c, d limit 10;
                                                               QUERY PLAN                                                               
----------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=2575.30..2575.33 rows=10 width=16) (actual time=4.602..4.606 rows=10 loops=1)
   ->  Sort  (cost=2575.30..2577.79 rows=993 width=16) (actual time=4.601..4.603 rows=10 loops=1)
         Sort Key: t.c, t.d
         Sort Method: top-N heapsort  Memory: 25kB
         ->  Append  (cost=8.26..2553.85 rows=993 width=16) (actual time=0.619..4.467 rows=1000 loops=1)
               ->  Bitmap Heap Scan on tp1 t_1  (cost=8.26..1272.26 rows=495 width=16) (actual time=0.618..3.556 rows=500 loops=1)
                     Recheck Cond: (b = 3)
                     Heap Blocks: exact=500
                     ->  Bitmap Index Scan on tp1_b_idx  (cost=0.00..8.14 rows=495 width=0) (actual time=0.219..0.220 rows=500 loops=1)
                           Index Cond: (b = 3)
               ->  Bitmap Heap Scan on tp2 t_2  (cost=8.28..1276.62 rows=498 width=16) (actual time=0.246..0.798 rows=500 loops=1)
                     Recheck Cond: (b = 3)
                     Heap Blocks: exact=500
                     ->  Bitmap Index Scan on tp2_b_idx  (cost=0.00..8.16 rows=498 width=0) (actual time=0.142..0.142 rows=500 loops=1)
                           Index Cond: (b = 3)
 Planning Time: 0.358 ms
 Execution Time: 4.827 ms
(17 rows)

The estimate_num_groups function, and how it can mis-estimate

Why is this happening? Richard explains on the mailing list post when Postgres calculates the cost of Incremental Sort, it needs to estimate the number of groups into which the relation is divided by the existing pre-sorted keys to then calculate the cost of sorting a single group:

When we calculate the cost of incremental sort, we need to estimate the number of groups into which the relation is divided by the presorted keys, and then calculate the cost of sorting a single group. If we over-estimate the number of groups, the startup cost of incremental sort would be under-estimated.

In the first plan above, the number of groups with equal 't.c' is estimated to be 300 by estimate_num_groups(), but is actually 3 after applying the restriction 'b = 3'. As a result, the startup cost of the incremental sort is estimated to be 143.03, but is actually 14222.68. That's why the planner mistakenly thinks the increment sort path is the cheaper one.

As Richard notes, Postgres overestimated the number of groups in the initial sort. And that way the startup cost of the Incremental Sort is actually underestimated and that ultimately causes us to use the incorrect index here.

The "estimate_num_groups" function in Postgres is a little bit of a potential hazard I would argue, because it really works off of limited data. Postgres doesn't have good statistics about cross table correlation, and a lot of the data that is available on single expressions, like, for example, when you use CREATE STATISTICS, doesn't really make it up into this part of Postgres.

The further discussion on the community mailing lists essentially revolved around how could Postgres handle this better?

Tomas Vondra summarized this well, which is the challenge is where to get usable information about correlation between columns. This is only a problem when columns are correlated. And unfortunately right now, this discussion is still ongoing, so we don't really have a patch.

So if you have correlated columns that you're grouping by, where you see an Incremental Sort node come up, you may want to try SET enable_incremental_sort = off for your query and see how that impacts the performance.

Presorted aggregates in Postgres 16

Let's talk about one other case of issues with Incremental Sort. In Postgres 16, David Rowley added a new setting called enable_presorted_aggregate. That allows the planner to do more efficient execution of aggregate functions that have an ORDER BY or DISTINCT clause. This is generally a good thing.

However there is the potential for the planner to make a bad choice. And as David notes in the commit, costing for Incremental Sort is not perfect as it assumes an even distribution of rows to sort within each sort group:

because the planner has the flexibility to instead choose a plan which just performs a full sort on both the GROUP BY and ORDER BY / DISTINCT aggregate columns, there is potential for the planner to make a bad choice. The costing for Incremental Sort is not perfect as it assumes an even distribution of rows to sort within each sort group.

Here we add an escape hatch in the form of the enable_presorted_aggregate GUC. This will allow users to get the pre-PG16 behavior in cases where they have no other means to convince the query planner to produce a plan which only sorts on the GROUP BY column(s).

A worse query plan, even when turning enable_incremental_sort off

Now, let's look at an example here that Pavel Luzanov posted on the mailing list. We have a simple table with three columns. And we have an index on two of the columns:

create table t (a text, b text, c text);
insert into t (a,b,c) select x,y,x from generate_series(1,100) as x, generate_series(1,10000) y;
create index on t (a,b);
vacuum analyze t;

We have a query that has a GROUP BY a, but inside that grouping, it does an array aggregate of the c column, ordered by c:

explain (analyze, buffers)
select a, array_agg(c order by c) from t group by a;

And so we can see here on Postgres 15, what we're getting is an index scan and then there is a GroupAggregate on top of that index scan:

                                                             QUERY PLAN                                                              
-------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=0.42..46685.65 rows=100 width=34) (actual time=7.127..283.200 rows=100 loops=1)
   Group Key: a
   Buffers: shared hit=193387 read=2745
   ->  Index Scan using t_a_b_idx on t  (cost=0.42..41684.40 rows=1000000 width=4) (actual time=0.037..119.402 rows=1000000 loops=1)
         Buffers: shared hit=193387 read=2745
 Planning:
   Buffers: shared hit=26
 Planning Time: 0.162 ms
 Execution Time: 283.256 ms
(9 rows)

Now we're going to repeat the same on Postgres 16. Here, we can see that the execution time actually went up significantly:

explain (analyze, buffers) select a, array_agg(c order by c) from t group by a;
                                                                QUERY PLAN                                                                 
-------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=1081.70..125631.39 rows=100 width=34) (actual time=15.902..405.652 rows=100 loops=1)
   Group Key: a
   Buffers: shared hit=196132
   ->  Incremental Sort  (cost=1081.70..120630.14 rows=1000000 width=4) (actual time=7.317..293.574 rows=1000000 loops=1)
         Sort Key: a, c
         Presorted Key: a
         Full-sort Groups: 100  Sort Method: quicksort  Average Memory: 27kB  Peak Memory: 27kB
         Pre-sorted Groups: 100  Sort Method: quicksort  Average Memory: 697kB  Peak Memory: 697kB
         Buffers: shared hit=196132
         ->  Index Scan using t_a_b_idx on t  (cost=0.42..41689.58 rows=1000000 width=4) (actual time=0.034..123.358 rows=1000000 loops=1)
               Buffers: shared hit=196132
 Planning Time: 0.115 ms
 Execution Time: 405.705 ms
(13 rows)

So we're going from 280 ms on Postgres 15, to 400 ms on Postgres 16. And really the difference here is the Incremental Sort node.

Now you could go ahead and turn off Incremental Sort, but what's really surprising here is that despite us turning off Incremental Sort, we're still getting a worse result than in Postgres 15:

SET enable_incremental_sort = off;
explain (analyze, buffers) select a, array_agg(c order by c) from t group by a;
                                                       QUERY PLAN                                                       
------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=128728.34..136229.59 rows=100 width=34) (actual time=245.198..478.368 rows=100 loops=1)
   Group Key: a
   Buffers: shared hit=5396, temp read=1950 written=1964
   ->  Sort  (cost=128728.34..131228.34 rows=1000000 width=4) (actual time=242.872..372.925 rows=1000000 loops=1)
         Sort Key: a, c
         Sort Method: external merge  Disk: 15600kB
         Buffers: shared hit=5396, temp read=1950 written=1964
         ->  Seq Scan on t  (cost=0.00..15396.00 rows=1000000 width=4) (actual time=0.011..84.811 rows=1000000 loops=1)
               Buffers: shared hit=5396
 Planning Time: 0.068 ms
 Execution Time: 479.285 ms
(11 rows)

And this is explained because the change that was made in Postgres 16, actually modified how Postgres prioritizes which sort order it needs. And so in order to help Incremental Sort, unfortunately this particular case was penalized, even if you turn off Incremental Sort.

Using the enable_presorted_aggregate setting

This is what the new enable_presorted_aggregate Postgres setting is for:

SET enable_presorted_aggregate = off;
explain (analyze, buffers) select a, array_agg(c order by c) from t group by a;
                                                             QUERY PLAN                                                              
-------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=0.42..46690.83 rows=100 width=34) (actual time=9.334..270.362 rows=100 loops=1)
   Group Key: a
   Buffers: shared hit=196132
   ->  Index Scan using t_a_b_idx on t  (cost=0.42..41689.58 rows=1000000 width=4) (actual time=0.028..109.772 rows=1000000 loops=1)
         Buffers: shared hit=196132
 Planning Time: 0.130 ms
 Execution Time: 270.410 ms
(7 rows)

We can see that now Postgres will pick an index scan and does the group aggregate, and because we're skipping that unnecessary sort step, because the GroupAggregate already can do the work, Postgres will now again have the fast query plan of 270 milliseconds.

So be aware that in this case, it's not enough to turn Incremental Sort off, you might actually need to use that special enable_presorted_aggregate setting to turn off the problematic behavior.

I hope you learned something new from E120 of 5mins of Postgres. Feel free to subscribe to our YouTube channel, sign up for our newsletter or follow us on LinkedIn and X/Twitter to get updates about new episodes!

What we have discussed in this episode of 5mins of Postgres


Enjoy blog posts like this?

Get them once a month to your inbox