Postgres Planner Quirks: Incremental Sort, and when it goes wrong
In today’s E120 of “5mins of Postgres” we're returning to our Postgres planner quirks series to talk about Incremental Sort, and when it goes wrong.
Incremental Sort can often speed up query plans when you have an existing sort order; however, there can be edge cases where the planner chooses a sub-optimal plan.
Share this episode: Click here to share this episode on LinkedIn. Feel free to sign up for our newsletter and subscribe to our YouTube channel.
Transcript
To be clear, Incremental sort is a great feature that got added in Postgres 13, that can often speed up query plans when you have an existing sort order, for example from a B-Tree index.
The benefits of Incremental Sort
A simple example might be an index scan that has an existing sort order in the index that Postgres can reuse. And so Postgres only needs to sort by an additional column. There are more complex cases, for example, let's imagine you have a GROUP BY, you have an aggregate function inside your queries, the aggregate function might need a particular order.
We talked previously on 5mins of Postgres about all the improvements that were added in Postgres 16 around Incremental Sort, and how they for example, make DISTINCT queries faster or ORDER BY and DISTINCT aggregates.
However I've recently come across a couple of cases where Incremental Sort actually makes performance worse. And I want to explain to you a little bit when these cases happen, so that you can see them more easily when you upgrade your Postgres.
Now to be clear, this is most likely to affect you if you're upgrading from before Postgres 13, where Incremental Sort was first added, or if you're upgrading to Postgres 16, where there are a couple of improvements regarding Incremental Sort that make it more likely to see an Incremental Sort node in your query plans.
A real-world incident after upgrading to Postgres 15 on Amazon RDS
We'll start with this bug report by Nicolas from Loxodata.
In this example, Nicolas walks us through how a customer of theirs did a migration from 11 to 15 on Amazon RDS and they noticed a degradation in response time on the query:
SELECT inputdocum0_.id AS col_0_0_
FROM document_management_services.input_document inputdocum0_
WHERE (inputdocum0_.indexation_domain_id in ('2d29daf6-e151-479a-a52a-78b08bb3009d'))
AND (inputdocum0_.indexation_subsidiary_id in ('9f9df402-f70b-40d9-b283-a3c35232469a'))
AND (inputdocum0_.locked_at IS NULL)
AND (inputdocum0_.locked_by_app IS NULL)
AND (inputdocum0_.locked_by_user IS NULL)
AND (inputdocum0_.lock_time_out IS NULL)
AND inputdocum0_.archiving_state<> 'DESTROYED'
AND (inputdocum0_.creation_state in ('READY'))
AND inputdocum0_.active_content=true
AND (inputdocum0_.processing_state in ('PENDING_INDEXATION'))
ORDER BY inputdocum0_.created_at ASC,
inputdocum0_.reception_id ASC,
inputdocum0_.reception_order ASC
LIMIT 50;
The query went from a few seconds of runtime to 10 minutes.
They of course did VACUUM ANALYZE
, that all seemed good, but then when they did an EXPLAIN ANALYZE
, Postgres showed a change of plan.
In Postgres 15 it shows an Incremental Sort, corresponding to the ORDER BY
clause on the created_at
column:
Limit (cost=324.05..16073.82 rows=50 width=44) (actual time=1663688.290..1663696.151 rows=50 loops=1)
Buffers: shared hit=114672881 read=5725197 dirtied=38564 written=24394
I/O Timings: shared/local read=1481378.069 write=313.574
-> Incremental Sort (cost=324.05..27838050.13 rows=88375 width=44) (actual time=1663688.289..1663696.144 rows=50 loops=1)
Sort Key: inputdocum0_.created_at, inputdocum0_.reception_id, inputdocum0_.reception_order
Presorted Key: inputdocum0_.created_at
Full-sort Groups: 2 Sort Method: quicksort Average Memory: 27kB Peak Memory: 27kB
Buffers: shared hit=114672881 read=5725197 dirtied=38564 written=24394
I/O Timings: shared/local read=1481378.069 write=313.574
-> Merge Append (cost=9.08..27834073.25 rows=88375 width=44) (actual time=1663655.490..1663696.068 rows=53 loops=1)
Sort Key: inputdocum0_.created_at
Buffers: shared hit=114672878 read=5725197 dirtied=38564 written=24394
I/O Timings: shared/local read=1481378.069 write=313.574
-> Index Scan using input_document_00_created_at_idx on input_document_00 inputdocum0__1 (cost=0.43..1393473.44 rows=4386 width=44) (actual time=71549.638..71550.023 rows=4 loops=1)
Filter: ((locked_at IS NULL) AND (locked_by_app IS NULL) AND (locked_by_user IS NULL) AND (lock_time_out IS NULL) AND active_content AND (archiving_state <> 'DESTROYED'::text) AND (indexation_domain_id = '2d29daf6-e151-479a-a52a-78b08bb3009d'::uuid) AND (indexation_subsidiary_id = '9f9df402-f70b-40d9-b283-a3c35232469a'::uuid) AND (creation_state = 'READY'::text) AND (processing_state = 'PENDING_INDEXATION'::text))
Rows Removed by Filter: 6142526
Buffers: shared hit=5765368 read=249016 dirtied=1845
I/O Timings: shared/local read=63290.786
...
Planning Time: 6.280 ms
Execution Time: 191.048 ms
(110 rows)
Versus in Postgres 11, there was a more efficient index being chosen:
Limit (cost=262056.35..262056.47 rows=50 width=44) (actual time=190.836..190.859 rows=50 loops=1)
Buffers: shared hit=165374 dirtied=1
-> Sort (cost=262056.35..262277.28 rows=88375 width=44) (actual time=190.835..190.848 rows=50 loops=1)
Sort Key: inputdocum0_.created_at, inputdocum0_.reception_id, inputdocum0_.reception_order
Sort Method: top-N heapsort Memory: 31kB
Buffers: shared hit=165374 dirtied=1
-> Append (cost=0.29..259120.59 rows=88375 width=44) (actual time=5.260..190.514 rows=1078 loops=1)
Buffers: shared hit=165374 dirtied=1
-> Index Scan using input_document_00_indexation_activity_id_indexation_domain_idx1 on input_document_00 inputdocum0__1 (cost=0.29..12844.44 rows=4386 width=44) (actual time=5.259..9.211 rows=56 loops=1)
Index Cond: (indexation_domain_id = '2d29daf6-e151-479a-a52a-78b08bb3009d'::uuid)
Filter: ((locked_at IS NULL) AND (locked_by_app IS NULL) AND (locked_by_user IS NULL) AND (lock_time_out IS NULL) AND active_content AND (archiving_state <> 'DESTROYED'::text) AND (indexation_subsidiary_id = '9f9df402-f70b-40d9-b283-a3c35232469a'::uuid) AND (creation_state = 'READY'::text))
Rows Removed by Filter: 6931
Buffers: shared hit=7989
...
Planning Time: 8.371 ms
Execution Time: 1663700.900 ms
(117 rows)
So we essentially see different indexes being used, in one case with the Incremental Sort node.
ORDER BY + LIMIT + Incremental Sort = wrong index choice
Now this is a variation of something that we've previously talked about, where you have an ORDER BY
statement in Postgres, and you have a LIMIT
and Postgres will then sometimes use an index that satisfies the ORDER BY condition instead of using an index that handles the WHERE clause.
Postgres does this because it thinks that it's more efficient to use a sorted result and essentially go through the rows. And once it got 50 in this case, it can terminate early.
So it doesn't actually think it needs to use the whole sorted result. Versus in the other case, it has to get all the rows that are matched by the WHERE condition and then do the sort. Now what surprised me and I didn't know before is that Incremental Sort actually affects the costing for cases like this.
A simplified example of when Incremental Sort causes bad plan costs
Richard Guo came up with an example that's a little bit simpler and shows the same problem:
create table t (a int, b int, c int, d int) partition by range (a);
create table tp1 partition of t for values from (0) to (1000);
create table tp2 partition of t for values from (1000) to (2000);
insert into t select i%2000, i%1000, i%300, i from generate_series(1,1000000) i;
create index on t(b);
analyze t;
So here we have a single table that has two partitions in it and Richard generated some data and then he creates two indexes.
Let's step through this example to show how the plans differ between 12 and 15, that I have running locally here. I'm going to start with 12 and what I can see on 12 is that our index condition is used as I would expect it to. Right. So this is the good plan.
explain analyze select * from t where b = 3 order by c, d limit 10;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=2585.46..2585.48 rows=10 width=16) (actual time=2.089..2.092 rows=10 loops=1)
-> Sort (cost=2585.46..2587.94 rows=993 width=16) (actual time=2.087..2.089 rows=10 loops=1)
Sort Key: tp1.c, tp1.d
Sort Method: top-N heapsort Memory: 25kB
-> Append (cost=12.27..2564.00 rows=993 width=16) (actual time=0.203..1.894 rows=1000 loops=1)
-> Bitmap Heap Scan on tp1 (cost=12.27..1280.60 rows=497 width=16) (actual time=0.202..0.991 rows=500 loops=1)
Recheck Cond: (b = 3)
Heap Blocks: exact=500
-> Bitmap Index Scan on tp1_b_idx (cost=0.00..12.15 rows=497 width=0) (actual time=0.096..0.096 rows=500 loops=1)
Index Cond: (b = 3)
-> Bitmap Heap Scan on tp2 (cost=12.27..1278.43 rows=496 width=16) (actual time=0.205..0.799 rows=500 loops=1)
Recheck Cond: (b = 3)
Heap Blocks: exact=500
-> Bitmap Index Scan on tp2_b_idx (cost=0.00..12.14 rows=496 width=0) (actual time=0.141..0.141 rows=500 loops=1)
Index Cond: (b = 3)
Planning Time: 0.318 ms
Execution Time: 2.119 ms
(17 rows)
Now if I go and run the same thing on 15, we can see that on 15 this is significantly slower:
explain analyze select * from t where b = 3 order by c, d limit 10;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=143.06..571.03 rows=10 width=16) (actual time=94.123..94.125 rows=10 loops=1)
-> Incremental Sort (cost=143.06..42683.17 rows=994 width=16) (actual time=94.121..94.122 rows=10 loops=1)
Sort Key: t.c, t.d
Presorted Key: t.c
Full-sort Groups: 1 Sort Method: top-N heapsort Average Memory: 25kB Peak Memory: 25kB
Pre-sorted Groups: 1 Sort Method: top-N heapsort Average Memory: 25kB Peak Memory: 25kB
-> Merge Append (cost=0.85..42646.26 rows=994 width=16) (actual time=4.768..94.075 rows=335 loops=1)
Sort Key: t.c
-> Index Scan using tp1_c_idx on tp1 t_1 (cost=0.42..21318.28 rows=499 width=16) (actual time=2.517..53.560 rows=168 loops=1)
Filter: (b = 3)
Rows Removed by Filter: 171537
-> Index Scan using tp2_c_idx on tp2 t_2 (cost=0.42..21318.03 rows=495 width=16) (actual time=2.250..40.480 rows=168 loops=1)
Filter: (b = 3)
Rows Removed by Filter: 171534
Planning Time: 0.242 ms
Execution Time: 94.146 ms
(16 rows)
Even in the simple example, on 12 we got 2 ms of runtime. On 15 we got 94 ms of runtime.
We can see that in the bad case, an Incremental Sort is chosen and Postgres doesn't actually use the indexes for filtering. Postgres does an Index Scan, but it essentially is not able to restrict how much results it needs to look at in the index, there is no index condition here, it's only used for its order.
And really the main difference here is that on 15, the exact same query uses that index for sorting and then with the Incremental Sort logic essentially incorrectly thinks that this is the cheaper path.
Unfortunately, we can see that even on Postgres 17, this exact same problem still exists:
explain analyze select * from t where b = 3 order by c, d limit 10;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=143.03..571.34 rows=10 width=16) (actual time=87.384..87.386 rows=10 loops=1)
-> Incremental Sort (cost=143.03..42673.26 rows=993 width=16) (actual time=87.383..87.384 rows=10 loops=1)
Sort Key: t.c, t.d
Presorted Key: t.c
Full-sort Groups: 1 Sort Method: top-N heapsort Average Memory: 25kB Peak Memory: 25kB
Pre-sorted Groups: 1 Sort Method: top-N heapsort Average Memory: 25kB Peak Memory: 25kB
-> Merge Append (cost=0.85..42646.27 rows=993 width=16) (actual time=3.955..87.352 rows=335 loops=1)
Sort Key: t.c
-> Index Scan using tp1_c_idx on tp1 t_1 (cost=0.42..21318.28 rows=495 width=16) (actual time=2.076..49.469 rows=168 loops=1)
Filter: (b = 3)
Rows Removed by Filter: 171537
-> Index Scan using tp2_c_idx on tp2 t_2 (cost=0.42..21318.06 rows=498 width=16) (actual time=1.879..37.860 rows=168 loops=1)
Filter: (b = 3)
Rows Removed by Filter: 171534
Planning Time: 1.369 ms
Execution Time: 87.461 ms
(16 rows)
Turning Incremental Sort off, if needed
You can set "enable_incremental_sort" to off, and rerun the EXPLAIN, and now here on 17, we can get the fast query plan:
SET enable_incremental_sort = off;
explain analyze select * from t where b = 3 order by c, d limit 10;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=2575.30..2575.33 rows=10 width=16) (actual time=4.602..4.606 rows=10 loops=1)
-> Sort (cost=2575.30..2577.79 rows=993 width=16) (actual time=4.601..4.603 rows=10 loops=1)
Sort Key: t.c, t.d
Sort Method: top-N heapsort Memory: 25kB
-> Append (cost=8.26..2553.85 rows=993 width=16) (actual time=0.619..4.467 rows=1000 loops=1)
-> Bitmap Heap Scan on tp1 t_1 (cost=8.26..1272.26 rows=495 width=16) (actual time=0.618..3.556 rows=500 loops=1)
Recheck Cond: (b = 3)
Heap Blocks: exact=500
-> Bitmap Index Scan on tp1_b_idx (cost=0.00..8.14 rows=495 width=0) (actual time=0.219..0.220 rows=500 loops=1)
Index Cond: (b = 3)
-> Bitmap Heap Scan on tp2 t_2 (cost=8.28..1276.62 rows=498 width=16) (actual time=0.246..0.798 rows=500 loops=1)
Recheck Cond: (b = 3)
Heap Blocks: exact=500
-> Bitmap Index Scan on tp2_b_idx (cost=0.00..8.16 rows=498 width=0) (actual time=0.142..0.142 rows=500 loops=1)
Index Cond: (b = 3)
Planning Time: 0.358 ms
Execution Time: 4.827 ms
(17 rows)
The estimate_num_groups function, and how it can mis-estimate
Why is this happening? Richard explains on the mailing list post when Postgres calculates the cost of Incremental Sort, it needs to estimate the number of groups into which the relation is divided by the existing pre-sorted keys to then calculate the cost of sorting a single group:
In the first plan above, the number of groups with equal 't.c' is estimated to be 300 by estimate_num_groups(), but is actually 3 after applying the restriction 'b = 3'. As a result, the startup cost of the incremental sort is estimated to be 143.03, but is actually 14222.68. That's why the planner mistakenly thinks the increment sort path is the cheaper one.
As Richard notes, Postgres overestimated the number of groups in the initial sort. And that way the startup cost of the Incremental Sort is actually underestimated and that ultimately causes us to use the incorrect index here.
The "estimate_num_groups" function in Postgres is a little bit of a potential hazard I would argue, because it really works off of limited data. Postgres doesn't have good statistics about cross table correlation, and a lot of the data that is available on single expressions, like, for example, when you use CREATE STATISTICS
, doesn't really make it up into this part of Postgres.
The further discussion on the community mailing lists essentially revolved around how could Postgres handle this better?
Tomas Vondra summarized this well, which is the challenge is where to get usable information about correlation between columns. This is only a problem when columns are correlated. And unfortunately right now, this discussion is still ongoing, so we don't really have a patch.
So if you have correlated columns that you're grouping by, where you see an Incremental Sort node come up, you may want to try SET enable_incremental_sort = off
for your query and see how that impacts the performance.
Presorted aggregates in Postgres 16
Let's talk about one other case of issues with Incremental Sort. In Postgres 16, David Rowley added a new setting called enable_presorted_aggregate
. That allows the planner to do more efficient execution of aggregate functions that have an ORDER BY
or DISTINCT
clause. This is generally a good thing.
However there is the potential for the planner to make a bad choice. And as David notes in the commit, costing for Incremental Sort is not perfect as it assumes an even distribution of rows to sort within each sort group:
Here we add an escape hatch in the form of the enable_presorted_aggregate GUC. This will allow users to get the pre-PG16 behavior in cases where they have no other means to convince the query planner to produce a plan which only sorts on the GROUP BY column(s).
A worse query plan, even when turning enable_incremental_sort off
Now, let's look at an example here that Pavel Luzanov posted on the mailing list. We have a simple table with three columns. And we have an index on two of the columns:
create table t (a text, b text, c text);
insert into t (a,b,c) select x,y,x from generate_series(1,100) as x, generate_series(1,10000) y;
create index on t (a,b);
vacuum analyze t;
We have a query that has a GROUP BY a
, but inside that grouping, it does an array aggregate of the c
column, ordered by c
:
explain (analyze, buffers)
select a, array_agg(c order by c) from t group by a;
And so we can see here on Postgres 15, what we're getting is an index scan and then there is a GroupAggregate on top of that index scan:
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=0.42..46685.65 rows=100 width=34) (actual time=7.127..283.200 rows=100 loops=1)
Group Key: a
Buffers: shared hit=193387 read=2745
-> Index Scan using t_a_b_idx on t (cost=0.42..41684.40 rows=1000000 width=4) (actual time=0.037..119.402 rows=1000000 loops=1)
Buffers: shared hit=193387 read=2745
Planning:
Buffers: shared hit=26
Planning Time: 0.162 ms
Execution Time: 283.256 ms
(9 rows)
Now we're going to repeat the same on Postgres 16. Here, we can see that the execution time actually went up significantly:
explain (analyze, buffers) select a, array_agg(c order by c) from t group by a;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=1081.70..125631.39 rows=100 width=34) (actual time=15.902..405.652 rows=100 loops=1)
Group Key: a
Buffers: shared hit=196132
-> Incremental Sort (cost=1081.70..120630.14 rows=1000000 width=4) (actual time=7.317..293.574 rows=1000000 loops=1)
Sort Key: a, c
Presorted Key: a
Full-sort Groups: 100 Sort Method: quicksort Average Memory: 27kB Peak Memory: 27kB
Pre-sorted Groups: 100 Sort Method: quicksort Average Memory: 697kB Peak Memory: 697kB
Buffers: shared hit=196132
-> Index Scan using t_a_b_idx on t (cost=0.42..41689.58 rows=1000000 width=4) (actual time=0.034..123.358 rows=1000000 loops=1)
Buffers: shared hit=196132
Planning Time: 0.115 ms
Execution Time: 405.705 ms
(13 rows)
So we're going from 280 ms on Postgres 15, to 400 ms on Postgres 16. And really the difference here is the Incremental Sort node.
Now you could go ahead and turn off Incremental Sort, but what's really surprising here is that despite us turning off Incremental Sort, we're still getting a worse result than in Postgres 15:
SET enable_incremental_sort = off;
explain (analyze, buffers) select a, array_agg(c order by c) from t group by a;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=128728.34..136229.59 rows=100 width=34) (actual time=245.198..478.368 rows=100 loops=1)
Group Key: a
Buffers: shared hit=5396, temp read=1950 written=1964
-> Sort (cost=128728.34..131228.34 rows=1000000 width=4) (actual time=242.872..372.925 rows=1000000 loops=1)
Sort Key: a, c
Sort Method: external merge Disk: 15600kB
Buffers: shared hit=5396, temp read=1950 written=1964
-> Seq Scan on t (cost=0.00..15396.00 rows=1000000 width=4) (actual time=0.011..84.811 rows=1000000 loops=1)
Buffers: shared hit=5396
Planning Time: 0.068 ms
Execution Time: 479.285 ms
(11 rows)
And this is explained because the change that was made in Postgres 16, actually modified how Postgres prioritizes which sort order it needs. And so in order to help Incremental Sort, unfortunately this particular case was penalized, even if you turn off Incremental Sort.
Using the enable_presorted_aggregate setting
This is what the new enable_presorted_aggregate
Postgres setting is for:
SET enable_presorted_aggregate = off;
explain (analyze, buffers) select a, array_agg(c order by c) from t group by a;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=0.42..46690.83 rows=100 width=34) (actual time=9.334..270.362 rows=100 loops=1)
Group Key: a
Buffers: shared hit=196132
-> Index Scan using t_a_b_idx on t (cost=0.42..41689.58 rows=1000000 width=4) (actual time=0.028..109.772 rows=1000000 loops=1)
Buffers: shared hit=196132
Planning Time: 0.130 ms
Execution Time: 270.410 ms
(7 rows)
We can see that now Postgres will pick an index scan and does the group aggregate, and because we're skipping that unnecessary sort step, because the GroupAggregate already can do the work, Postgres will now again have the fast query plan of 270 milliseconds.
So be aware that in this case, it's not enough to turn Incremental Sort off, you might actually need to use that special enable_presorted_aggregate
setting to turn off the problematic behavior.
I hope you learned something new from E120 of 5mins of Postgres. Feel free to subscribe to our YouTube channel, sign up for our newsletter or follow us on LinkedIn and X/Twitter to get updates about new episodes!
What we have discussed in this episode of 5mins of Postgres
- "Implement Incremental Sort" commit by Tomas Vondra in Postgres 13
- "Incremental Sort" node in EXPLAIN output
- 5mins of Postgres - Faster query plans with Postgres 16: Incremental Sorts, Anti-JOINs and more
- "Planner chooses incremental [sort] but not the best [index]" discussion on pgsql-hackers
- "estimate_num_groups" function in Postgres
- "Add enable_presorted_aggregate GUC" commit by David Rowley in Postgres 16
- "Add proper planner support for ORDER BY / DISTINCT aggregates" discussion on pgsql-hackers