Connect Claude Code, Codex & Cursor to your Postgres statistics: pganalyze MCP Server is now in Public Preview

Postgres in Production, Special Series: Deep Dive into pg_stat_statements (Part 4)

In Part 4 of this special "Postgres in Production" deep dive series, Ryan Booz takes you into the pg_stat_statements source code to show how it decides what to keep when the hash table fills up. There's a per-query counter you never see in the view, eviction sorts every entry on every deallocation, only completed executions get stored at all, and a higher pg_stat_statements.max is not free.



Share this episode: Click here to share this episode on LinkedIn. Feel free to sign up for our newsletter and subscribe to our YouTube channel.


Transcript

A quick recap

In Part 1 we looked at the in-memory hash table that backs pg_stat_statements and the metrics it tracks. In Part 2 we dug into what makes an entry "unique" — the queryid, derived from the structural shape of the query. And in Part 3 we walked through where the query text actually lives: a separate file on disk, sized at 2× the mean query length × pg_stat_statements.max.

This episode is honestly the one I'm most excited about. We're going to go look at the source code itself — the structs, the functions, the workflow — and answer the question that's been lurking under everything so far: when the table is full and a new query comes in, what gets kept and what gets dropped?

I'll warn you upfront that this episode is a little longer than the others. There's a lot to cover. The good news is the actual source is small: take out the comments and pg_stat_statements is only about 1,800 lines of code. Most of that is defensive — making sure the right things get stored at the right times. The story underneath is honestly pretty simple, once you can see it.

What the source code reveals

When you look at the C structs in the source, you start to see things we've already discussed. The hash table key is what you'd expect: user ID, database ID, queryid, and the top-level boolean that says whether the query was a top-level statement or executed inside a function.

The pgssEntry struct holds a counters array. That's where the metrics live — number of calls, total time, mean time, JIT counters, buffers read, all of it. From Part 3 you'll also see the offset and length pointing into the external query text file.

Then there's a shared state struct with a handful of variables that the workers need across processes. One of those, mean_query_length, is going to matter a lot in a minute.

The three values that drive allocation

As I've spent more time in this code over the last year, I've come up with three things that I think are really necessary to understand about how pg_stat_statements decides what to keep. There's more going on than just these three, but if you understand these, the rest of the workflow makes sense.

1. pg_stat_statements.max

We've talked about this throughout the series and it works exactly as you'd expect. When a brand-new query comes in and we've already reached max entries, pg_stat_statements has no other path — it has to deallocate something to make room, and may also have to truncate the external text file before it can store the new entry.

2. The hidden usage counter

This is the one most people don't know about. Every hash table entry has a per-query usage value stored alongside the regular metrics. It's not returned in the pg_stat_statements view. You never see it directly. It exists solely to help decide which entries get dropped when deallocation has to run.

It's incremented by one every time a query finishes executing. Interestingly, the function that does the incrementing is called USAGE_EXEC() and it takes total time as a parameter — but if you read the code, the value being added is statically typed to 1. The total time goes nowhere. There's actually a note in the source asking whether this is the right approach, or whether time or buffer usage would be a better proxy for "which queries do we want to keep around longest?" That discussion has been ongoing in the community for a while.

The mechanics during deallocation are:

  1. Decrease every entry's usage by 1%.
  2. Sort all entries by usage.
  3. Drop the bottom 5%.
  4. Increment the deallocation counter.

One interesting observation: while pg_stat_statements is filling up and no deallocation has happened yet, calls and usage will be the same number for every entry — they're both going up by one per execution. They only diverge after the first deallocation.

3. mean_query_length

This is the mean length of all the query texts currently in the hash table. It's only updated during deallocation, right before the bottom 5% gets dropped. Once it's recomputed, pg_stat_statements uses it to decide whether the external text file is now more than twice the expected length. If it is, the file gets rewritten (the "garbage collection" step we'll cover below).

Only completed executions get stored

Before we go further, one thing worth being absolutely clear on: pg_stat_statements only stores metrics for completed executions. It's called from the ExecutorEnd hook, after planning and after everything else has run.

That means no aborted or timed-out query metrics are stored. If you have a statement_timeout of 30 seconds and a query hits that timeout after reading hundreds of thousands of buffers, those buffers don't show up in pg_stat_statements at all. The query effectively didn't happen, as far as this extension is concerned.

I want to flag this directly because if you ask an AI about pg_stat_statements in mid-2026 it will often tell you the opposite. I queried this recently and got confidently wrong answers. The source is unambiguous: nothing in the code stores metrics for timed-out queries. This is also one of the problems the community is actively trying to solve — how do we get visibility into how often a query is timing out, and how much work it did before it died? Right now, we don't.

The three functions that do the work

When you actually read the code, there are three primary functions for storing data:

  • pgss_store() — every completed execution goes through this. It looks up the hash, and if it's there, just updates the counters. Done.
  • entry_alloc() — called when the query is new and the entry needs to be allocated in the hash table.
  • entry_dealloc() — called from inside entry_alloc when the table is full and we need to make room.

Two secondary functions handle the external query text file:

  • need_gc_qtexts() — checks whether the file is more than 2× expected size and flags that garbage collection is needed.
  • gc_qtexts() — does the actual garbage collection: rewrites the file with the texts of whatever's still in the (now smaller) hash table.

The fast path: when the entry already exists

This is the path you want most of your workload to be on. pgss_store() gets called at the end of execution, looks up the queryid in the hash table, finds it, takes a shared lock, updates the counters, and releases all locks. That's it. It's fast, it's cheap, and many backends can do it concurrently.

If you're in a steady-state workload where pg_stat_statements has captured your common queries and isn't churning, this is what's happening for the vast majority of executions.

The slow path: when the entry doesn't exist

This is where the work piles up. pgss_store() looks up the queryid, doesn't find it, and now we have to do real work under an exclusive lock:

  1. Switch from shared to exclusive lock.
  2. Check whether the external text file needs garbage collection later.
  3. Check whether we've hit pg_stat_statements.max. If yes, deallocate:
    • Decrease all usage values by 1% (or 50% for sticky entries).
    • Sort every entry by usage.
    • Drop the bottom 5%.
    • Increment the deallocation counter.
  4. Call entry_alloc(), which searches one more time before allocating — another backend might have inserted the same queryid in the gap between our locks.
  5. If garbage collection was flagged, load the entire query text file into memory, then overwrite it in place (not a copy + delete — pg_stat_statements actually overwrites the file and truncates the tail with a C truncate call).
  6. Allocate the new entry, initialize its counters, write the text into the file, release the exclusive lock.

The longer this path is and the more often you're on it, the more time your queries spend waiting for the exclusive lock.

Why a larger pg_stat_statements.max is not free

Look at the deallocation step again: every time we deallocate, we sort all entries by usage and drop the bottom 5%. That sort is linear in the number of entries.

If you've bumped max from the default 5,000 to 10,000, 50,000, or 100,000 — which is tempting when you're losing queries to eviction — then every deallocation has to touch and sort that many rows, all under an exclusive lock. You haven't eliminated the deallocation cost; you've made each one more expensive, while still potentially deallocating frequently if your workload's cardinality is high enough.

There's a dance here. A higher max reduces deallocation frequency, but each deallocation costs more. A larger external text file means garbage collection rewrites more data when it triggers. If you only allocate more space without addressing the underlying cardinality, you pay the cost somewhere else.

We'll get into the configuration choices that actually help in the next couple of episodes.

Key takeaways

  • There's a hidden per-query usage counter stored alongside the metrics. It's not in the pg_stat_statements view but it determines which queries get evicted.
  • Eviction sorts the whole table. Every deallocation decreases all usage values by 1%, sorts every entry, and drops the bottom 5%.
  • Only completed executions are stored. Aborted and timed-out queries leave no record in pg_stat_statements — regardless of what an AI tells you.
  • Three primary functions do all the work: pgss_store, entry_alloc, entry_dealloc. The fast path is shared-lock counter updates. The slow path runs deallocation, garbage collection of the external text file, and entry allocation under exclusive lock.
  • A higher pg_stat_statements.max is not free. The deallocation sort is linear in max, so a bigger table means each eviction touches more rows under exclusive lock.

What's coming next

Now that we've seen how pg_stat_statements decides what to keep, the last couple of episodes turn to what you can actually do about it. We'll look at the configuration that matters most, what high-cardinality workloads really look like under this allocation/deallocation pressure (and how Postgres 18's improvements to WHERE IN help), and concrete strategies for using pg_stat_statements safely at scale.

I hope this series helps you better understand one of the most essential tools in the Postgres ecosystem. Feel free to subscribe to our YouTube channel, sign up for our newsletter or follow us on LinkedIn to get updates about new episodes!

What we discussed in this episode


Enjoy blog posts like this?

Get them once a month to your inbox