Postgres Planner Quirks: How to fix bad JSONB selectivity estimates

In today’s E115 of “5mins of Postgres” we discuss a common challenge when using equality and contains operators for querying JSONB columns, and why the Postgres planner will often get the estimates wrong for such queries, causing bad query plans.



Share this episode: Click here to share this episode on LinkedIn or on X/Twitter. Feel free to sign up for our newsletter and subscribe to our YouTube channel.


Transcript

We'll start with this blog post by Eric Fritz on the Render blog.

The problem: A slow query on a JSONB field

This is a blog from earlier this week where Eric talks about an issue that they run into in their deployment pipeline. And they narrowed down the problem that they were seeing to this particular query here:

SELECT e.id
FROM events e
JOIN postgres_dbs db ON (e.data ->> 'serviceId') = db.database_id
LIMIT 1;

The "e.data" column is a JSONB column, and they're extracting a value here, "serviceId", and then they're comparing it against a different table. And so what they were seeing is that this query here which was expected to be fast, took an extremely long time and they were kind of puzzled why that was happening.

I want to point out two things here:

  1. As they are noting, this query is bounded by a LIMIT 1 - this will have a big influence on chosen query plans
  2. They do have a multi-column B-tree index that covers the e.data ->> 'serviceId' expression

Additionally, for context, there are two tables involved here. Their "postgres_dbs" table and their "events" table. "postgres_dbs" has about 100,000 rows, "events" has about 90 million rows. The number of events that are related to the Postgres databases are quite small. It's just 0.01% of all events.

Eric referred to this as a needle in a haystack. Only a small portion of the rows are actually necessary.

EXPLAIN shows an unexpected Merge Join

They've run an EXPLAIN, and in the EXPLAIN they first get what looks like potentially a reasonable result, which is they get two index scans:

                                                         QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.98..4.39 rows=1 width=25)
   ->  Merge Join  (cost=0.98..306608988.86 rows=90103848 width=25)
         Merge Cond: ((events.data ->> 'serviceId'::text) = postgres_dbs.database_id)
         ->  Index Scan using events_service_id_timestamp_idx on events  (cost=0.57..305056170.48 rows=90103848 width=192)
         ->  Index Only Scan using postgres_dbs_database_id_key on postgres_dbs  (cost=0.42..200755.06 rows=101119 width=26)
(5 rows)

But you'll note that these index scans don't have any index conditions. And so that means Postgres is using the index scan here only to return a sorted result, but then the actual filtering is happening in the merge condition. And so obviously this is quite expensive.

Now the other thing that they note here is that the serviceId that this index is sorted on, that the merge join prefers, has a lot of entries that come ahead of the entries they're looking for. And so even if the Merge Join could terminate early, because it was able to match the condition, because there's a lot of events that are unrelated, it has to do a lot of unnecessary work despite the LIMIT.

Adding additional WHERE conditions to limit the Index Scan

The first thing that they put in place is, they said, if we're getting too many unnecessary rows, can we just skip those? And so they add an explicit WHERE condition saying only look for the "serviceId" that is prefixed with this Postgres service ID, which works in their particular situation:

SELECT e.id
FROM events e
JOIN postgres_dbs db ON (e.data ->> 'serviceId') = db.database_id
WHERE (e.data ->> 'serviceId') >= 'dpg-'
LIMIT 1;

Now, this was actually much faster (1.47ms), but what's interesting is that the query plan was still a merge join.

The only difference here is that they were seeing an index condition on the events table now, and so it was able to essentially reduce the amount of data it returned from that index.

They switched to doing some testing without the LIMIT, just to understand how Postgres would behave if you were fetching all of the matching data. And so what they ended up doing here was essentially refining this WHERE condition further to only search for the prefixed events that are related to Postgres. And this ended up giving quite a reasonable performance here for the full search, and then of course also for the LIMIT.

If you have more information about the problem domain, you can add additional WHERE conditions to help Postgres find a better query plan.

Now, Eric said, "we could have declared success here". If they only had wanted to fix a slow query, they would be done.

The root cause: A bad selectivity estimate

But they actually did dive deeper because they were saying, well, shouldn't Postgres have been smarter here, why did we need to add this WHERE condition?

This essentially comes back to "pg_stats" and the information that Postgres collects in the planner statistics. And so if you have a JSONB column, in this case the data column, Postgres does end up collecting samples from that column, this also includes collecting most common values and histogram bounds for the JSONB column:

SELECT * FROM pg_stats WHERE tablename = 'events' AND attname = 'data';
-[ RECORD 1 ]----------------------------------------------------------------------------------------------------------
tablename         | events
attname           | data
null_frac         | 0
n_distinct        | -0.22765513
most_common_vals  | {
                  |   {"reason": {"nonZeroExit": 1}, "serviceId": "srv-bpo3bedifurfbl64o7tg"},
                  |   {"reason": {"nonZeroExit": 1}, "serviceId": "srv-bpo329difurfbl64o720"},
                  |   ...
                  | }
most_common_freqs | { 0.002282, 0.002010, ... }
histogram_bounds  | {
                  |   {"serviceId": "crn-bhsokgffhqe91l7vg0n0", "cronJobRunId": "crn-bhsokgffhqe91l7vg0n0-1551537300"},
                  |   {"serviceId": "crn-bhsokgffhqe91l7vg0n0", "cronJobRunId": "crn-bhsokgffhqe91l7vg0n0-1552037820"},
                  |   ...
                  | }

However, because in their situation, they have a lot of events that are not related to the Postgres databases they're tracking, this essentially meant that despite there being most common values in the histograms, Postgres wasn't able to use this in it's planning effectively.

The solution: Extended Statistics on Expressions

Now what they ended up doing here, and this I think is essentially the most universal fix for this type of problem, is that they used extended statistics on expressions. You'll remember that last week we talked about a small security issue in that code, that was recently fixed.

Extended statistics on expressions is a feature that was added in Postgres 14, and so here, for example, we can say don't just collect statistics on the data column itself, do collect statistics on data ->> 'serviceId':

CREATE STATISTICS IF NOT EXISTS events_service_id
ON (data ->> 'serviceId'::text)
from events;

And that way, after running ANALYZE, Postgres now actually remembers more information about that particular aspect:

SELECT * FROM pg_stats_ext_exprs;
-[ RECORD 1 ]----------------------------------------------------------------------------------------------------------------
tablename         | events
statistics_name   | events_service_id
expr              | (data ->> 'serviceId'::text)
null_frac         | 0.0069796667
n_distinct        | 76228
most_common_vals  | {
                  |   crn-c7u453rs13jp86lgjg3g, crn-cf0rnjdv55c4kpmol3dg, crn-cf0aihdv55c708f13a1g, crn-cf0oell2qhu025k7u4p0,
                  |   ...,
                  |   srv-cf06b5u2ofe582arfj5g, srv-cevkofq254ubij4mgelg, srv-cevlocdv55cfcfun2p8g, srv-bptpnraps24ffc87hf5g
                  | }
most_common_freqs | {
                  |   0.0222010000, 0.0119753330, 0.0119540000, 0.0119480000,
                  |   ....,
                  |   0.0009033333, 0.0009000000, 0.0008976667, 0.0008960000
                  | }
histogram_bounds  | {
                  |   crn-bhm5trt334m912f601i0, crn-biv8696bo897vckjtlpg, crn-buncspffjtv9715thc00, crn-c1od9kegv2c20j2l50b0,
                  |   ...,
                  |   srv-cnj2njac7pgfeu8rc7q0, srv-cnj40mf88oc3nmb3g88g, srv-cnj4mlv88oc74520ihgg, srv-cogpale28t4943ck9b7g
                  | }

This information will now help Postgres get the good query plan, without having to add that WHERE condition.

What if you have a contains (@>) operator?

However, you don't always have equality comparisons. For example, in this 2020 blog post by Vsevolod Solovyov he talks about pitfalls with JSONB indexes and he has a slightly different query. So he's doing queries where he's saying JSONB field contains this particular value:

SELECT id FROM articles WHERE origin_ids @> '{"pubmed": "123456"}';

It's very common that you would use the contains operator to do a search on your table, and have a GIN index with the jsonb_path_ops operator class. That's actually a good pattern if you have unstructured data.

However what Vsevolod was seeing here is that when you had a lot of these comparisons in this case, an OR condition, for something containing this publication or the other publication or the other publication. Then at some point, Postgres incorrectly chose a sequential scan after a sufficient number of these OR conditions with the contains operator were placed in the query:

EXPLAIN ANALYZE
 SELECT id, origin_ids
   FROM articles
  WHERE origin_ids @> '{"pubmed": "123456"}' OR
        origin_ids @> '{"pubmed": "654321"}' OR
        ....;   - x200
                        QUERY PLAN
------------------------------------------------------------
 Seq Scan on articles  (rows=7805036) (actual rows=200)
   Filter: ((origin_ids @> '{"pubmed": "123456"}') OR
            (origin_ids @> '{"pubmed": "654321"}') OR   ...x200)

And so here, for example, Postgres, clearly mis-estimated, it thought it would get 7 million rows back, but it only got 200 rows back.

Now, Vsevolod did some digging and found the "contsel" function (Postgres source), used by Postgres for this selectivity estimate, which has a fixed number of 0.1% for estimating the selectivity:

/*
 *	contsel -- How likely is a box to contain (be contained by) a given box?
 *
 * This is a tighter constraint than "overlap", so produce a smaller
 * estimate than areasel does.
 */

Datum
contsel(PG_FUNCTION_ARGS)
{
	PG_RETURN_FLOAT8(0.001);
}

So that means for a table of 43 million rows, it would estimate 43,000 rows to be the result of the contains operator. And then if you multiply that out and account for overlaps, Postgres ultimately comes up with that 7.8 million row estimate.

Now, this used to be very common issue in older Postgres versions. So, for example, Adrien Nayrat in 2017 wrote a "JSONB and statistics" blog post, where he also talked about this exact problem. And he also ultimately ended up finding the "contsel" function.

Postgres 13+: Less-silly selectivity for JSONB matching operators

A few years ago, in Postgres 13, Postgres did actually get a little bit smarter here.

So back in 2020, Tom Lane started a mailing list thread on "less silly selectivity estimates for JSONB matching operators", which ended up in this commit in Postgres 13, which improves selectivity estimations for certain assorted match-style operators, specifically that contains operator, which as Tom notes here, used contsel, and that was oftentimes not good.

Now there are two important changes here:

  1. The contains operator now does take into account most common values and histograms. So there are cases where if you have a high enough default statistics target, the operator will actually give you reasonable estimations now for JSONB columns.

  2. The fallback selectivity was actually made less selective, because as Tom noted here, 0.1% is actually kind of a silly estimate because these contains operators are oftentimes less selective than an equality comparison.

However, if you find yourself in a situation where the fallback value is used, issues such as the one that Vsevolod ran into, where he suddenly got a very high row estimate, might actually be even worse on newer Postgres versions.

Rewriting JSONB queries to check equality, instead of containment

And so I want to mention one last thing, which is what Adrien ended up doing in his blog post back in 2017, he changed from a contains operator to an equality operator.

Note that Adrien used functions back then to allow collection of statistics, but that is no longer needed on Postgres 14+, where you can use extended statistics on expressions instead.

If you're querying the same field, you can see a significant performance improvement by changing a query like:

explain (analyze,buffers)  select * from json_stack
      where json @>  '{"age":27}'::jsonb;
                      QUERY PLAN
---------------------------------------------------------------------
 Bitmap Heap Scan on json_stack
        (cost=286.95..33866.98 rows=33283 width=1011)
        (actual time=748.968..5720.628 rows=804630 loops=1)
   Recheck Cond: (json @> '{"age": 27}'::jsonb)
   Rows Removed by Index Recheck: 14
   Heap Blocks: exact=735531
   Buffers: shared hit=123417 read=780542
   I/O Timings: read=1550.124
   ->  Bitmap Index Scan on json_stack_json_idx
        (cost=0.00..278.62 rows=33283 width=0)
        (actual time=545.553..545.553 rows=804644 loops=1)
         Index Cond: (json @> '{"age": 27}'::jsonb)
         Buffers: shared hit=9612 read=5140
         I/O Timings: read=11.265
 Planning time: 0.079 ms
 Execution time: 5796.219 ms
(12 lignes)

To instead be something like:

explain (analyze,buffers)   select * from json_stack
      where json->>'age' = '27';
explain (analyze,buffers)   select * from json_stack
      where json_age(json) = '27';
                         QUERY PLAN
------------------------------------------------------------------------
 Index Scan using json_stack_json_age_idx on json_stack
  (cost=0.56..733177.05 rows=799908 width=1011)
  (actual time=0.042..2355.179 rows=804630 loops=1)
   Index Cond: ((json ->> 'age'::text) = '27'::text)
   Buffers: shared read=737720
   I/O Timings: read=1431.275
 Planning time: 0.087 ms
 Execution time: 2410.269 ms

In conclusion

If you find yourself doing a lot of lookups on JSONB columns, and you're currently using the contains operator, try to instead do a lookup on the field like Eric at Render is doing, and then change that to an equality comparison.

Make sure you have a B-Tree index on that expression, and then also create extended statistics on the expression, so that the Postgres query planner has better information and can come up with a better plan.

I hope you learned something new from E115 of 5mins of Postgres. Feel free to subscribe to our YouTube channel, sign up for our newsletter or follow us on LinkedIn and X/Twitter to get updates about new episodes!

What we have discussed in this episode of 5mins of Postgres


Enjoy blog posts like this?

Get them once a month to your inbox