Introducing Query Tuning Workbooks to safely tune Postgres queries on production with pganalyze!

Waiting for Postgres 17: Better Query Plans for Materialized CTE Scans

In today’s E118 of “5mins of Postgres” we discuss two changes in the upcoming Postgres 17 release that improve query plans for queries that involve CTEs. This can improve query plans where you would see an explicit CTE scan, due to use of the MATERIALIZED keyword, or because Postgres wasn't able to pull up a query to the upper plan level.



Share this episode: Click here to share this episode on LinkedIn. Feel free to sign up for our newsletter and subscribe to our YouTube channel.


Transcript

Column statistics from Materialized CTE scans

If you're working with queries that contain explicit CTE scans, one thing that might've been surprising to you in the past, if you took a really close look at it, is that the column statistics from columns referenced in the CTE scan weren't being propagated upwards. So that means if you're working with the results of that CTE scan, Postgres will oftentimes have a mis-estimate because the selectivity calculations will be wrong.

This commit by Tom Lane, co-authored by Jian Guo from last November fixes that. This commit is in Postgres 17 beta1, and we can likely expect this to be in the final Postgres 17 release in September or October.

Let's take a closer look at the background here. Hans Buschmann posted to the mailing list last year, where he was describing wrong row estimations with joins of CTEs slowing down queries by a factor of more than 500.

It was a little bit of a long report, but it was enough for community member Tomas Vondra to produce a reproducer and Jian Guo to propose a patch that addresses this.

What is the intent of MATERIALIZED?

The patch here isn't necessarily complicated, but there was some discussion around what does the MATERIALIZED keyword actually mean in a CTE? Tom Lane summarized it as:

When we say that MATERIALIZED is meant as an optimization fence, what we mostly mean is that the upper query shouldn't influence the choice of plan for the sub-query. However, we surely allow our statistics or guesses for the sub-query to subsequently influence what the upper planner does. If that weren't true, we shouldn't even expose any non-default rowcount guess to the upper planner --- but that would lead to really horrid results, so we allow that information to percolate up from the sub-query. It seems like exposing column statistics to the upper planner, as the proposed patch does, isn't fundamentally different from exposing rowcount estimates.

Information can propagate up to the outer planner level, but not down into the CTE plan.

There was one more discussion item here after the patch got committed, which is the fact that path keys are not being propagated upwards from a CTE either. We'll get back to that a bit later.

First of all though, I want to highlight one more example where this will actually be improved that we can test ourselves.

Another example with a reproducer

This is a report by Yan Wu to the Postgres bugs mailing list. And one thing I like about this report is that it has a reproducer even though Yan says this was originally seen on AWS Aurora, which usually community doesn't like it if you ask Aurora questions because you should ask AWS. But in this case, Yan went through the effort of actually building a small reproducer. And so you should do this, if you use managed service providers, don't expect the community to do the leg work for you.

But if you can, provide a small reproducer like this:

create table t1(a int);
create table t2(b int);
create index my_index on t1 using btree (a);
insert into t1 select generate_series(1, 100000) from generate_series(1, 3);
insert into t2 select generate_series(1, 100) from generate_series(1, 10);
analyze t1;
analyze t2;

And so what Yan was essentially seeing here was that the information about how distinct the results were, was not being propagated upwards.

Let's take a quick look at that, just to illustrate the difference between 16 and 17.

So I have 16 and 17 running in a container here. I'm just going to run these steps on both of them. And then if I run the given example, I see two different plans:

explain with my_cte as materialized (select b from t2) select *
from t1 where t1.a in (select b from my_cte);

On 16, what we'll see is the 200 rows estimate that was essentially the concern:

                                  QUERY PLAN                                   
-------------------------------------------------------------------------------
 Nested Loop  (cost=37.92..2667.98 rows=3020 width=4)
   CTE my_cte
     ->  Seq Scan on t2  (cost=0.00..15.00 rows=1000 width=4)
   ->  HashAggregate  (cost=22.50..24.50 rows=200 width=4)
         Group Key: my_cte.b
         ->  CTE Scan on my_cte  (cost=0.00..20.00 rows=1000 width=4)
   ->  Index Only Scan using my_index on t1  (cost=0.42..13.11 rows=3 width=4)
         Index Cond: (a = my_cte.b)
(8 rows)

And then if we run the example on 17, I'll see that first of all, the query plan has changed, which in this case is expected, because Postgres suddenly got different statistics:

                                      QUERY PLAN                                       
---------------------------------------------------------------------------------------
 Merge Join  (cost=42.25..58.70 rows=307 width=4)
   Merge Cond: (t1.a = my_cte.b)
   CTE my_cte
     ->  Seq Scan on t2  (cost=0.00..15.00 rows=1000 width=4)
   ->  Index Only Scan using my_index on t1  (cost=0.42..12676.67 rows=300000 width=4)
   ->  Sort  (cost=26.82..27.07 rows=100 width=4)
         Sort Key: my_cte.b
         ->  HashAggregate  (cost=22.50..23.50 rows=100 width=4)
               Group Key: my_cte.b
               ->  CTE Scan on my_cte  (cost=0.00..20.00 rows=1000 width=4)
(10 rows)

And really the root cause here is this rows 100 estimate:

         ->  HashAggregate  (cost=22.50..23.50 rows=100 width=4)
               Group Key: my_cte.b
               ->  CTE Scan on my_cte  (cost=0.00..20.00 rows=1000 width=4)

Again, the reason that things changed here is because you have this HashAggregate node sitting on the outside. It's not the row estimate from the CTE itself that has changed, it's the information about the columns that are being returned from the CTE, where previously Postgres essentially had no information and now, thanks to the work by Jian Guo and Tom Lane, this is now improved.

Sorted output from Materialized CTEs

Let's come back to pathkeys. Path keys are information that are attached to a path, aka the early stage of a plan node in Postgres, and they help the planner understand the sort order of that particular path, of that particular part of the plan.

And so previously, if Postgres was materializing a CTE, and then you had another node using the result of it, even if the materialization caused a certain sort order to be guaranteed either because it was an index scan with a sort order or there was an explicit sort node, Postgres wouldn't be aware of that and wouldn't be able to reuse the previously guaranteed sort order, unnecessarily adding a new Sort node or potentially using different join methods, for example, a Merge Join would be less likely if you didn't have a particular order guaranteed.

This patch by Richard Guo, which was committed by Tom Lane, fixes that particular problem. Let's take a look at an example discussed here. So, first of all, you can see that there is a CTE with the MATERIALIZED keyword, and it's essentially getting some unique data from one of the Postgres regression test tables, and then joins it against itself. But again, it passes through that optimization fence enforced by a CTE:

explain with x as materialized (select unique1 from tenk1 b order by unique1)
select count(*) from tenk1 a
  where unique1 in (select * from x);

And I'm going to illustrate this locally, where I've loaded the test data from the Postgres regression test database, similar to how we've previously discussed it in episode 108 of 5mins of Postgres:

CREATE TABLE tenk1 (
        unique1         int4,
        unique2         int4,
        two                     int4,
        four            int4,
        ten                     int4,
        twenty          int4,
        hundred         int4,
        thousand        int4,
        twothousand     int4,
        fivethous       int4,
        tenthous        int4,
        odd                     int4,
        even            int4,
        stringu1        name,
        stringu2        name,
        string4         name
);
CREATE INDEX tenk1_unique1 ON tenk1 USING btree(unique1 int4_ops);
\copy tenk1 FROM 'postgres/src/test/regress/data/tenk.data';
VACUUM ANALYZE tenk1;

You can see in 16, we have a Nested Loop Join:

                                           QUERY PLAN                                           
------------------------------------------------------------------------------------------------
 Aggregate  (cost=764.29..764.30 rows=1 width=8)
   CTE x
     ->  Index Only Scan using tenk1_unique1 on tenk1 b  (cost=0.29..306.29 rows=10000 width=4)
   ->  Nested Loop  (cost=225.28..445.50 rows=5000 width=0)
         ->  HashAggregate  (cost=225.00..227.00 rows=200 width=4)
               Group Key: x.unique1
               ->  CTE Scan on x  (cost=0.00..200.00 rows=10000 width=4)
         ->  Index Only Scan using tenk1_unique1 on tenk1 a  (cost=0.29..1.08 rows=1 width=4)
               Index Cond: (unique1 = x.unique1)
(9 rows)

Versus in 17 we have a Merge Join:

                                             QUERY PLAN                                             
----------------------------------------------------------------------------------------------------
 Aggregate  (cost=987.55..987.56 rows=1 width=8)
   CTE x
     ->  Index Only Scan using tenk1_unique1 on tenk1 b  (cost=0.29..306.29 rows=10000 width=4)
   ->  Merge Semi Join  (cost=0.31..656.26 rows=10000 width=0)
         Merge Cond: (a.unique1 = x.unique1)
         ->  Index Only Scan using tenk1_unique1 on tenk1 a  (cost=0.29..306.29 rows=10000 width=4)
         ->  CTE Scan on x  (cost=0.00..200.00 rows=10000 width=4)
(7 rows)

And really the difference is, on 16, Postgres wasn't able to understand that this CTE scan had a particular sort order. So instead of adding an additional Sort node, which would have made things more expensive than a Nested Loop, it used the Nested Loop. Versus on 17, we can see that because the Index Only Scan had a particular order already, we can reuse this order and therefore use a Merge Join.

On that same example, Tom Lane pointed out something important, which is, if we take out, in this example, the "ORDER BY unique 1", which is explicitly added into the CTE, then it doesn't work as expected:

explain with x as materialized (select unique1 from tenk1 b)
select count(*) from tenk1 a
  where unique1 in (select * from x);

On Postgres 16, we'll see the exact same plan, so this doesn't make a difference:

                                           QUERY PLAN                                           
------------------------------------------------------------------------------------------------
 Aggregate  (cost=764.29..764.30 rows=1 width=8)
   CTE x
     ->  Index Only Scan using tenk1_unique1 on tenk1 b  (cost=0.29..306.29 rows=10000 width=4)
   ->  Nested Loop  (cost=225.28..445.50 rows=5000 width=0)
         ->  HashAggregate  (cost=225.00..227.00 rows=200 width=4)
               Group Key: x.unique1
               ->  CTE Scan on x  (cost=0.00..200.00 rows=10000 width=4)
         ->  Index Only Scan using tenk1_unique1 on tenk1 a  (cost=0.29..1.08 rows=1 width=4)
               Index Cond: (unique1 = x.unique1)
(9 rows)

Versus in 17, we can see that it does choose a Hash Join now instead of a Merge Join:

                                             QUERY PLAN                                             
----------------------------------------------------------------------------------------------------
 Aggregate  (cost=1100.07..1100.08 rows=1 width=8)
   CTE x
     ->  Index Only Scan using tenk1_unique1 on tenk1 b  (cost=0.29..306.29 rows=10000 width=4)
   ->  Hash Semi Join  (cost=325.29..768.79 rows=10000 width=0)
         Hash Cond: (a.unique1 = x.unique1)
         ->  Index Only Scan using tenk1_unique1 on tenk1 a  (cost=0.29..306.29 rows=10000 width=4)
         ->  Hash  (cost=200.00..200.00 rows=10000 width=4)
               ->  CTE Scan on x  (cost=0.00..200.00 rows=10000 width=4)
(8 rows)

Presumably because of the other improvements on the CTEs surfacing columns statistics, it still doesn't choose the Nested Loop here, but it does choose a Hash Join instead of a Merge Join.

And the root cause of that issue is because the Index Only Scan doesn't know that the parent needs it in a particular order, and thus doesn't produce the right set of pathkeys for the parent, despite it being the same index being scanned. This may be revised in future releases, but for Postgres 17 this is a known limitation.

In conclusion

If you have queries involving explicit CTE scans, you will likely see better query plans in Postgres 17.

However, if you're trying to get the benefit of sorted output from a materialized CTE, and you're not seeing the expected plan on 17, try adding an explicit ORDER BY statement so that Postgres records the path keys that are necessary for this to work successfully behind the scenes.

I hope you learned something new from E118 of 5mins of Postgres. Feel free to subscribe to our YouTube channel, sign up for our newsletter or follow us on LinkedIn and X/Twitter to get updates about new episodes!

What we have discussed in this episode of 5mins of Postgres


Enjoy blog posts like this?

Get them once a month to your inbox