Connect Claude Code, Codex & Cursor to your Postgres statistics: pganalyze MCP Server is now in Public Preview

Postgres in Production, Special Series: Deep Dive into pg_stat_statements (Part 2)

In Part 2 of this special "Postgres in Production" deep dive series, Ryan Booz walks through what actually makes an entry in pg_stat_statements "unique", why uniqueness is structural rather than textual, and how a few common patterns (ORMs, dynamic SQL, multi-database setups) can quietly fill up the in-memory hash table and cost you visibility into the queries that matter most.



Share this episode: Click here to share this episode on LinkedIn. Feel free to sign up for our newsletter and subscribe to our YouTube channel.


Transcript

A quick recap

In the first episode, we looked at the in-memory hash table that backs pg_stat_statements and the columns it exposes — about 45 of them in Postgres 18, with more likely to come in future versions. In this episode, we dig into what actually makes an entry in that table "unique", and why getting that wrong can quietly cost you visibility into your most important queries.

What makes an entry unique

Each entry in pg_stat_statements is keyed on four things:

  • The role that ran the query
  • The database it ran in
  • The queryid, a hash derived from the structure of the query
  • Whether the query was top-level

That last one deserves a brief explanation, because it's controlled by a setting that's easy to miss.

A note on top-level queries

By default, pg_stat_statements.track is set to top. A top-level query is the one you initiated directly from a session or transaction. For example, running SELECT title FROM ... from your application is a top-level query.

If that query in turn triggers other queries, typically by calling a function or stored procedure, those internal queries are not top-level. They were initiated by Postgres on your behalf, not by you.

With the default setting, you'll see the function call itself in pg_stat_statements, but not the queries it runs internally. If you change the setting to all, those internal statements get tracked too. The catch: there is no column anywhere in pg_stat_statements that ties an internal statement back to the parent that called it. You can see the metrics, but you can't reconstruct the call chain. As of Postgres 18, that's a known limitation, and nothing on the horizon for 19 or beyond changes it.

How quickly entries multiply

Even a simple workload can produce a surprising number of entries. Imagine a single query, SELECT title FROM film WHERE id = ?. Now imagine running it across:

  • Three databases on the same server (for example, one per customer, all with the same schema)
  • A handful of roles per database

You'll get one entry per (role, database, queryid) combination. In a setup with three databases, three users, and one application user running the query in each database, you can land at five entries from a single logical query. Change the SELECT clause to SELECT title, rating FROM film WHERE id = ? and you get a brand-new queryid, doubling the count to ten.

Now scale that up across an ORM that emits dozens of variants of every query, and you can see how the entry count climbs fast.

This matters because pg_stat_statements has a hard ceiling. The pg_stat_statements.max setting controls how many unique entries it will track at once, and the default is 5,000. You can raise or lower it, but at any moment in time, that's the upper limit. If you go over, something has to be evicted.

Uniqueness is structural, not textual

This is the part most people get wrong, myself included. The instinctive assumption is that uniqueness is based on the query text, what you wrote or what your ORM generated. It's not. It's based on the query's structure.

Postgres parses the query, walks the parse tree, strips out things that don't affect the structure (like literal values), serializes what's left, and hashes the result. That hash is the queryid.

The internal name for this process is jumbling, handled by a function called JumbleQuery(). You'll hear the term in mailing list discussions and in the source code, so it's worth knowing.

The practical upshot: this query, run with three different parameter values:

SELECT title FROM film WHERE id = 1;
SELECT title FROM film WHERE id = 2;
SELECT title FROM film WHERE id = 3;

produces a single entry in pg_stat_statements. You could run it a thousand times with a thousand different IDs and you'd still get one entry.

What gets normalized away

Anything that doesn't change the shape of the query gets normalized:

  • Literals and constants (numbers, strings, dates)
  • Parameter values
  • Whitespace and formatting
  • Comments

That last one trips people up. A common pattern is to inject comments into queries to track which application function generated them, something like /* user_service.fetch_profile */ SELECT .... Comments are stripped during jumbling, so two structurally identical queries with different comments collapse into one entry.

There's a subtle quirk here: when an entry is first created, pg_stat_statements stores the exact text of the first execution it saw. Every subsequent execution with the same queryid contributes its metrics to that entry, but the stored text doesn't change. So if your first execution happened to carry one comment and the next thousand carried different ones, the view will only ever show you that first comment. The metrics are accurate; the text is misleading.

Why SQL comment tags can disappear

The same thing applies to SQLcommenter-style tags, which ORMs and observability tools often use to attach metadata (route, controller, feature flag) to a query. They're just comments, so jumbling treats them the same way.

I see this come up regularly: a customer is staring at an OpenTelemetry span with a SQLcommenter tag, trying to find the matching execution in pg_stat_statements, and they can't. The tag they're looking for isn't in the stored text, because some other execution (with a different tag, or no tag at all) got there first.

What doesn't get normalized

Anything that affects the structure of the query does affect the queryid:

  • The number of columns selected
  • The number of items in an IN list
  • ORDER BY clauses
  • JOIN structure
  • The number of WHERE conditions

Take a long, complex query with multiple CTEs. If the only thing that changes between executions is the set of WHERE conditions at the end, each variant gets its own entry.

The IN-clause explosion (and how Postgres 18 fixes it)

This one is worth singling out, because it's a common foot-gun and it explains a lot of "why is my pg_stat_statements full?" investigations.

In Postgres 17 and below, every distinct length of an IN list produces a unique queryid:

SELECT * FROM film WHERE id IN (1);
SELECT * FROM film WHERE id IN (1, 2);
SELECT * FROM film WHERE id IN (1, 2, 3);
-- ... and so on

If your ORM cheerfully sends IN lists of every possible length, you can blow through hundreds of entries from a single logical query.

Postgres 18 finally fixes this. It rewrites WHERE id IN (1, 2, 3) into WHERE id = ANY('{1, 2, 3}'), which is structurally a single query regardless of the array length. The result: one entry in pg_stat_statements no matter how many values you pass. If you've ever stared at a pg_stat_statements view that was 90% IN-clause variants, this alone is a strong reason to look at upgrading.

Structure vs. semantics

It's tempting to assume that semantically equivalent queries get the same queryid. They don't. pg_stat_statements cares about structure, not meaning.

These two queries produce identical query plans:

SELECT * FROM film WHERE film_id = 557;
SELECT * FROM film WHERE 557 = film_id;

But they're structurally different, so they're two entries. The same goes for:

SELECT * FROM film WHERE film_id IN (1, 2, 3);
SELECT * FROM film WHERE film_id = ANY('{1, 2, 3}');

Logically and operationally identical. Two entries.

What happens when pg_stat_statements fills up

This is the reason any of this matters. When the number of tracked entries hits pg_stat_statements.max, the next new query coming along has to displace something. Postgres deallocates an existing entry, and the metrics for that query are gone until the query runs again and a fresh entry gets created.

Raising pg_stat_statements.max helps, but it's not a fix. If your workload is generating thousands of unintentionally unique queries, you'll fill up whatever ceiling you set. The better move is to understand where the structural variation is coming from and reduce it at the source.

Key takeaways

  • High cardinality is the enemy. Every unique queryid eats a slot, and slots are finite.
  • ORMs are a major source of cardinality. Different argument shapes, optional clauses, and dynamic column lists all create new queryids.
  • Dynamic SQL has the same problem. If you build queries by string concatenation, every variation is its own entry.
  • Multi-database and multi-role setups multiply everything. pg_stat_statements is per-instance, but its key includes role and database.
  • Eviction is silent. You won't get an error or a warning. Entries just disappear, and so do their metrics.

What's coming next

Now that we've covered what makes an entry unique, the next episode looks at where the query text itself is actually stored. Quick spoiler: it's not in the in-memory hash table, and the way Postgres handles it has real consequences for how reliably you can identify queries after the fact. We'll dig into that in Part 3.

I hope this series helps you better understand one of the most essential tools in the Postgres ecosystem. Feel free to subscribe to our YouTube channel, sign up for our newsletter or follow us on LinkedIn to get updates about new episodes!

What we discussed in this episode


Enjoy blog posts like this?

Get them once a month to your inbox