EXPLAIN (ANALYZE, BUFFERS) and interpreting shared buffers hit in Nested Loops

In E83 of “5mins of Postgres” we're talking about using EXPLAIN (ANALYZE, BUFFERS) to optimize Postgres queries, and what to know when interpreting the shared hits counter for Nested Loops.

Share this episode: Click here to share this episode on twitter, or sign up for our newsletter and subscribe to our YouTube channel.


Let's have a look!


We'll start with a blog post by Will Jessop on his personal blog. If you've been using EXPLAIN (ANALYZE, BUFFERS), some of that might seem familiar, but I'll point out a few important things to know at the end that you might not know yet.

Generally speaking, the BUFFERS option for EXPLAIN ANALYZE lets you track what Postgres calls buffer usage. That's telling you what type of data Postgres is reading or writing and which of that data was in cache vs had to be fetched from the operating system, which might be actually from disk or might be from the operating system cache.

The most important thing to know is that there is this terminology of a shared block or shared buffers, which means the shared buffers in Postgres, which is the main memory area that Postgres pre-allocates that caches data from regular tables and indexes.

Local blocks, on the other hand, are temporary tables and indexes, versus temporary blocks is short term working data used in sorts, hashes, or Materialize plan nodes. Usually what you'll find is that shared blocks are the most important data point to look at.

Optimizing slow queries with the BUFFERS option

Let's look at an example that Will has shared here. This is a customer situation, a slow query that that client was experiencing. And so Will went in and used EXPLAIN (ANALYZE, BUFFERS) to optimize the query.

I encounter myself using EXPLAIN (ANALYZE, BUFFERS) when I have a slow query and I want to understand, why is the query slow, why is the query consuming so much I/O?

With the BUFFERS option, we get this additional data point for each of the plan nodes. You can use this to identify which part of the plan is doing the most work in terms of reading the most data. In the particular plan shared by Will, we can see that the Index Scan inside the Nested loop has the most shared buffers hit and the most shared buffers read.

How to interpret buffer counts

If we see, buffers: shared read=3100. What does it actually mean? These buffers are also called blocks in Postgres and they are essentially the fundamental working unit that Postgres uses when it reads data or writes data from your tables. By default, in most Postgres installations, you will find the block size to be 8 kB.

Whether you're going through the cache or the disk, BUFFERS can tell you both data points.With a count like buffers: shared hit=1166 read=3117, we have a portion of the data coming from that shared buffer cache and a portion of the data coming from the disk, or as mentioned previously, from the operating system cache.

In the particular example, Will was able to reduce the original query runtime from 2200 milliseconds down to 6 milliseconds. This is the cold cache runtime that Will was optimizing for. The idea is that when things are not in cache, then that's really where those buffer counts matter a lot because I/O is still the most expensive portion of your queries often. If you want to optimize the cold cache runtime, EXPLAIN (ANALYZE, BUFFERS) is one of the most important tools in your arsenal.

The challenge with shared buffer hits

I do want to point out one more thing that is not well known, but it's definitely a problem when you're looking at shared hit counters like this. Interpreting this number correctly is really important. I'll show you one thing where this number is not correct in the sense that it does not mean that this much data would be fetched when there was a cold cache situation.

Let's switch over to the terminal for that and I'm going to just start up a Postgres server. We'll create a reference table, which is a smaller table. And then we'll create a data table. These tables are pretty small, they're less than 100 megabytes each. The data table has 10 records for each reference record.

CREATE TABLE reference_table (id int PRIMARY KEY, padding text);
INSERT INTO reference_table SELECT i, 'Postgres 16 is awesome' FROM generate_series(1, 100_000) AS _(i);

CREATE TABLE data_table (id int PRIMARY KEY, reference_table_id int);
CREATE INDEX ON data_table(reference_table_id);
INSERT INTO data_table SELECT i, i % 100_000 FROM generate_series(1, 1_000_000) AS _(i);

ANALYZE data_table;
ANALYZE reference_table;

What I'm going to show you is mostly related to Nested Loop joins. I'm actually going to turn off Hash Joins and Merge Joins to illustrate that.

SET enable_hashjoin = off;
SET enable_mergejoin = off;

Now I'm going to do EXPLAIN (ANALYZE, BUFFERS) and I'm just joining those two tables.

EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM data_table d JOIN reference_table r ON (r.id = reference_table_id);
                                                                     	QUERY PLAN                                                                     	 
 Nested Loop  (cost=0.42..94932.00 rows=1000000 width=35) (actual time=0.031..282.826 rows=999990 loops=1)
   Buffers: shared hit=1300726
   ->  Seq Scan on reference_table r  (cost=0.00..1736.00 rows=100000 width=27) (actual time=0.014..4.914 rows=100000 loops=1)
     	Buffers: shared hit=736
   ->  Index Scan using data_table_reference_table_id_idx on data_table d  (cost=0.42..0.83 rows=10 width=8) (actual time=0.001..0.002 rows=10 loops=100000)
     	Index Cond: (reference_table_id = r.id)
     	Buffers: shared hit=1299990
   Buffers: shared hit=76 read=1
 Planning Time: 0.536 ms
 Execution Time: 311.046 ms
(11 rows)

This shared hit count is really large. Its 1.2 million shared hit blocks.

If we believe this number, it means that we had to read a lot of data. We can actually calculate what that number is in megabytes by multiplying it with the block size. You can see that this essentially says we had 10 gigabytes of data that was loaded from the cache.

SELECT pg_size_pretty(1299990 * current_setting('block_size')::bigint);
 10156 MB
(1 row)

I don't believe that, because here in the system, the cache is only 128 megabytes. Something doesn't add up here. If you add up the data table and the data table index, they only come to a little bit under 6000 pages. It makes no sense why we're getting this large a number here.

We can actually verify this another way as well, by loading something else into the cache to move out that data. We create a table that's bigger than that shared buffer cache. And then, we run the query again, we'll now get the cold cache behavior, so the data is no longer in the Postgres buffer cache.

CREATE TABLE bigger_than_cache (id int, col1 int);
INSERT INTO bigger_than_cache SELECT i, i % 10 FROM generate_series(1, 10_000_000) AS _(i);

EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM data_table d JOIN reference_table r ON (r.id = reference_table_id);
                                                                     	QUERY PLAN                                                                     	 
 Nested Loop  (cost=0.42..94932.00 rows=1000000 width=35) (actual time=1.740..316.254 rows=999990 loops=1)
   Buffers: shared hit=1294192 read=6534 written=3669
   ->  Seq Scan on reference_table r  (cost=0.00..1736.00 rows=100000 width=27) (actual time=0.358..9.938 rows=100000 loops=1)
     	Buffers: shared read=736 written=414
   ->  Index Scan using data_table_reference_table_id_idx on data_table d  (cost=0.42..0.83 rows=10 width=8) (actual time=0.001..0.002 rows=10 loops=100000)
     	Index Cond: (reference_table_id = r.id)
     	Buffers: shared hit=1294192 read=5798 written=3255
   Buffers: shared hit=4 read=4
 Planning Time: 1.563 ms
 Execution Time: 344.762 ms
(11 rows)

It shows 5700 blocks read. That's exactly the size of the table plus the size of the index. That's the amount of data it had to read.

How Nested Loops count shared buffers hit

But then it had a lot of shared hit counts, which don't really make sense. So what's going on here? Really, this is a function of the Nested Loop. Postgres iterates over a table or an index again and again in a Nested Loop, in this case 100000 loops. And each time it looks at a page inside this Nested Loop, the hit count goes up. We might be looking at the exact same page, 10 times, 100 times, 1000 times, in that whole query execution, but that doesn't mean that this much data would have to be read when this is not in cache. It just means that Postgres had to look at this page again and again.

And unfortunately, we don't have a better way of tracking this in Postgres today. What we would want is how many distinct pages were hit. If the same page keeps being hit in cache a thousand times, that's not really interesting. But if it's a thousand different pages, that's interesting.

Run a cold cache test for Nested Loops

So, watch out if you have Nested Loops, this is a problem mainly with Nested Loops, then you should not trust the shared hit counts to be accurate. You can definitely trust the read counts to be accurate, but if you have a really high hit count, then try to do a cold cache test like we're doing here to illustrate how much data would actually have to be read from disk, or if it's just the same page being hit again and again from cache.

Thanks for joining us for E83 of 5mins of Postgres. Subscribe to our YouTube channel, sign up for our newsletter and follow us on LinkedIn and Twitter to get updates about new episodes!

What we have discussed in this episode of 5mins of Postgres

Enjoy blog posts like this?

Get them once a month to your inbox