Waiting for Postgres 17: Benchmarking with pg_buffercache_evict

In today’s E110 of “5mins of Postgres” we discuss the pg_buffercache_evict function recently committed to Postgres 17, that lets you remove individual pages from the Postgres shared buffer cache. We show how this can be used for testing query performance on a test system, and the impact of double buffering. We also discuss how to flush the OS page cache for a given table to run a benchmark on a query and avoid confusing EXPLAIN (ANALYZE, BUFFERS) output.



Share this episode: Click here to share this episode on LinkedIn or on X/Twitter. Feel free to sign up for our newsletter and subscribe to our YouTube channel.


Transcript

Whenever we talk about Postgres 17, it's important to remind you that these features may still change. So there's a chance that this API changes until the final release or sometimes features get reverted because there were unforeseen issues with it. Now, additionally for this particular feature, I want to say that this is intended for developer testing.

This functionality authored by Palak Chaturvedi and committed by Thomas Munro is intended for Postgres developers that are testing the shared buffer cache behavior. It is not intended for end users. However, I think there is a benefit for some of us doing benchmarking. The most important thing to understand is that this new functionality does not work well under systems that see a lot of concurrency. So this really only works when you're testing things in isolation, not in your production database, but potentially on a clone of your production database, when you're doing benchmarking.

How pg_buffercache_evict works

Earlier this week, Daniel Westermann also wrote a good blog post summarizing how this new pg_buffercache_evict function works. Let's take a quick look at the basic functionality, and then I'm going to show you how we can use this during benchmarking. This new function is part of the pg_buffercache extension.

And if you're not familiar with pg_buffercache, it essentially shows you the contents of the internal shared buffer cache that Postgres uses to cache the data in your tables directly in Postgres shared memory. This is in addition to the operating systems page cache, which also caches your tables in certain cases.

So this is why Postgres has this double caching, and this can sometimes be a little confusing. So to be clear, we're talking about the internal Postgres cache here.

Now, the new pg_buffercache_evict function is pretty straight forward. It takes a single argument, the buffer ID from the buffer cache, and then it removes that page from the buffer cache:

SELECT pg_buffercache_evict(123);

Now, obviously as mentioned in the commit message, there is a concurrency issue here. So the buffer cache might change, you query it first to get the buffer ID and then some other process in Postgres might choose to do something with that buffer and then you're essentially evicting the wrong thing. So again, only use this during testing.

But the benefit here is that now we can say, if our table is loaded in the shared buffer cache, we can now explicitly remove it from the shared buffer cache to be able to test caching behavior for example. I want to show you how this works.

A practical test run on a VM

So I have an EC2 instance running here with an EBS volume attached, I'm going to come back to why that matters in a second. And we're going to connect to this and we're going to make a table.

The test setup

So the table here is pretty straightforward. We're just making a test table with a million records:

CREATE TABLE test AS SELECT * FROM generate_series(1, 1000000);

And this table is pretty small, it's only 35 megabytes:

\dt+ test
                                  List of relations
 Schema | Name | Type  |  Owner   | Persistence | Access method | Size  | Description 
--------+------+-------+----------+-------------+---------------+-------+-------------
 public | test | table | postgres | permanent   | heap          | 35 MB | 
(1 row)

Now, this will fit in shared buffers. So if I'm looking at my "shared_buffers" here, it's 128 megabytes:

SHOW shared_buffers;
 shared_buffers 
----------------
 128MB
(1 row)

A simple sequential scan

So clearly enough space for this. Now if I just run a simple "EXPLAIN (ANALYZE, BUFFERS)" that selects all the data from that table, we'll see that right now, about half of that table is in shared buffers:

EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM test;
                                                  QUERY PLAN                                                   
---------------------------------------------------------------------------------------------------------------
 Seq Scan on test  (cost=0.00..14480.00 rows=1000000 width=4) (actual time=0.023..51.179 rows=1000000 loops=1)
   Buffers: shared hit=2304 read=2176
 Planning:
   Buffers: shared hit=5
 Planning Time: 0.059 ms
 Execution Time: 81.833 ms
(6 rows)

Using pg_prewarm to load the full table

Now, I can use the pg_prewarm extension to force it to load the full table into shared buffers:

CREATE EXTENSION pg_prewarm;
SELECT pg_prewarm('test');
 pg_prewarm 
------------
       4480
(1 row)

As you can see, this takes down the execution time, slightly to be around 80 milliseconds in time:

EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM test;
                                                  QUERY PLAN                                                   
---------------------------------------------------------------------------------------------------------------
 Seq Scan on test  (cost=0.00..14480.00 rows=1000000 width=4) (actual time=0.007..48.450 rows=1000000 loops=1)
   Buffers: shared hit=4480
 Planning Time: 0.028 ms
 Execution Time: 79.131 ms
(4 rows)

Confirming buffer cache contents

Now, if I look at my shared buffer contents, you can see that in the shared buffer contents, most of it is that table:

SELECT n.nspname, c.relname, count(*) AS buffers
    FROM pg_buffercache b JOIN pg_class c
    ON b.relfilenode = pg_relation_filenode(c.oid) AND
    b.reldatabase IN (0, (SELECT oid FROM pg_database
                            WHERE datname = current_database()))
    JOIN pg_namespace n ON n.oid = c.relnamespace
    GROUP BY n.nspname, c.relname
    ORDER BY 3 DESC
    LIMIT 10;
  nspname   |             relname             | buffers 
------------+---------------------------------+---------
 public     | test                            |    4485
 pg_catalog | pg_class                        |      14
 pg_catalog | pg_proc_proname_args_nsp_index  |       8
 pg_catalog | pg_proc                         |       6
 pg_catalog | pg_attribute                    |       5
 pg_catalog | pg_attribute_relid_attnum_index |       5
 pg_catalog | pg_statistic                    |       4
 pg_catalog | pg_type_oid_index               |       4
 pg_catalog | pg_class_oid_index              |       4
 pg_catalog | pg_depend_depender_index        |       3
(10 rows)

Using pg_buffercache_evict

And I'm going to use the new pg_buffercache_evict function to remove this table from shared buffers:

SELECT DISTINCT pg_buffercache_evict(bufferid)
  FROM pg_buffercache
 WHERE relfilenode = pg_relation_filenode('test');
 pg_buffercache_evict 
----------------------
 t

I can first check, it's no longer in shared buffers and now I can run EXPLAIN ANALYZE again, and you can see that now it does "shared read" instead of "shared hit":

EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM test;
                                                  QUERY PLAN                                                   
---------------------------------------------------------------------------------------------------------------
 Seq Scan on test  (cost=0.00..14480.00 rows=1000000 width=4) (actual time=0.037..53.124 rows=1000000 loops=1)
   Buffers: shared read=4480
 Planning Time: 0.028 ms
 Execution Time: 83.909 ms

So the ANALYZE, BUFFERS option will show you where the data is coming from. In this case, when it says read, that means it goes to the disk, or the OS page cache. In the case of hit, it goes to the shared buffer cache.

Now the performance here didn't really change and that might seem surprising, right? Because things are out of cache, why is it still fast?

The impact of the OS page cache

And so the reason for that is, that the table is still in the operating system's internal caches, which are separate from Postgres, which the buffer cache evict doesn't touch.

Now, we can confirm that by first getting the location of our table:

SELECT current_setting('data_directory') || '/' || pg_relation_filepath('test');
/usr/local/pgsql/data/base/16384/16422

So each table in Postgres has files, one or more files. The main file for this table is in this location.

And so on my virtual machine here, I'm going to use the fincore utility which is built into Linux, which counts how much of a particular file is in the operating system's page cache:

fincore /usr/local/pgsql/data/base/16384/16422
  RES PAGES SIZE FILE
  35M  8960  35M /usr/local/pgsql/data/base/16384/16422

So this file here has 35 MB, matches our table size. And out of that 35 MB file, all of it is currently in the page cache, which is the left most value.

Evicting data from the OS page cache with dd / POSIX_FADV_DONTNEED

Now, there are ways for us to evict something, not just from the shared buffer cache, but also from the OS page cache. Here, we've got to use Linux utilities. So this you can not do on a managed service provider unless they do it for you. But we can do it on our own virtual machine. And so I didn't know this before I was researching this episode, we can use the "dd" utility to evict things from the OS page cache for us, see also discussion on StackExchange.

The magic flag here is this oflag=nocache. The other flags are required to avoid dd from actually modifying the file, and to sync the change:

dd oflag=nocache conv=notrunc,fdatasync count=0 of=/usr/local/pgsql/data/base/16384/16422
0+0 records in
0+0 records out
0 bytes copied, 0.00455487 s, 0.0 kB/s

What this will do is it will tell the Linux kernel, "Hey, this file is no longer needed. You can drop it from the OS page cache". We can confirm this with the "fincore" utility:

fincore /usr/local/pgsql/data/base/16384/16422
RES PAGES SIZE FILE
 0B     0  35M /usr/local/pgsql/data/base/16384/16422

So the file is still 35 MB in size, but none of it is in the OS page cache.

Testing the cold cache case

We can run another test here and let's just evict the buffer cache one more time:

SELECT DISTINCT pg_buffercache_evict(bufferid)
  FROM pg_buffercache
 WHERE relfilenode = pg_relation_filenode('test');
 pg_buffercache_evict 
----------------------
 t

So right now, nothing in the buffer cache, nothing in the OS page cache. Let's see how long our EXPLAIN (ANALYZE, BUFFERS) takes:

EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM test;
                                                   QUERY PLAN                                                   
----------------------------------------------------------------------------------------------------------------
 Seq Scan on test  (cost=0.00..14480.00 rows=1000000 width=4) (actual time=1.397..206.896 rows=1000000 loops=1)
   Buffers: shared read=4480
 Planning Time: 0.027 ms
 Execution Time: 237.682 ms
(4 rows)

237 milliseconds. This is a big difference from the 83 milliseconds we saw earlier, even though we got "shared read", which means Postgres didn't have a cache hit, that was still fast because the OS page cache had a cache hit.

Now that this has run, we can see that now it's again in the OS page cache:

fincore /usr/local/pgsql/data/base/16384/16422
  RES PAGES SIZE FILE
  35M  8960  35M /usr/local/pgsql/data/base/16384/16422

Sequential scans use ring buffers to avoid the cache being filled

And so if I run this again, what you will see is that Postgres actually did not load the whole file into the Postgres shared buffers:

EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM test;
                                                  QUERY PLAN                                                   
---------------------------------------------------------------------------------------------------------------
 Seq Scan on test  (cost=0.00..14480.00 rows=1000000 width=4) (actual time=0.026..53.446 rows=1000000 loops=1)
   Buffers: shared hit=33 read=4447
 Planning Time: 0.027 ms
 Execution Time: 84.250 ms
(4 rows)

And that's really because we used a sequential scan here, and Postgres has special logic when you are doing sequential scans to not evict other contents from shared buffers to make space for that sequential scan. And so what you're seeing right now actually is that all of this data here is still coming from the OS page cache for the most part.

Testing shared buffer cache without OS page cache

Now let's do one more test, which is we'll force the file to be loaded here again with pg_prewarm:

SELECT pg_prewarm('test');
 pg_prewarm 
------------
       4480

Which means that now we get "shared hit" for the full table:

EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM test;
                                                  QUERY PLAN                                                   
---------------------------------------------------------------------------------------------------------------
 Seq Scan on test  (cost=0.00..14480.00 rows=1000000 width=4) (actual time=0.006..48.463 rows=1000000 loops=1)
   Buffers: shared hit=4480
 Planning Time: 0.026 ms
 Execution Time: 79.160 ms
(4 rows)

And now we're going to evict the whole file from the OS page cache, but keep it in the buffer cache:

dd oflag=nocache conv=notrunc,fdatasync count=0 of=/usr/local/pgsql/data/base/16384/16422
0+0 records in
0+0 records out
0 bytes copied, 0.00316387 s, 0.0 kB/s

And so if we run this again:

EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM test;
                                                  QUERY PLAN                                                   
---------------------------------------------------------------------------------------------------------------
 Seq Scan on test  (cost=0.00..14480.00 rows=1000000 width=4) (actual time=0.009..48.528 rows=1000000 loops=1)
   Buffers: shared hit=4480
 Planning Time: 0.028 ms
 Execution Time: 79.227 ms
(4 rows)

You see it's still takes 80 milliseconds. Because right now Postgres is doing all the caching for us, and it's fast again.

If I now were to look at what's cached on the OS side, still nothing:

fincore /usr/local/pgsql/data/base/16384/16422
RES PAGES SIZE FILE
 0B     0  35M /usr/local/pgsql/data/base/16384/16422

And so now it can drop it again from the Postgres side, and now it's slow again:

EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM test;
                                                   QUERY PLAN                                                   
----------------------------------------------------------------------------------------------------------------
 Seq Scan on test  (cost=0.00..14480.00 rows=1000000 width=4) (actual time=0.621..223.764 rows=1000000 loops=1)
   Buffers: shared read=4480
 Planning Time: 0.027 ms
 Execution Time: 255.362 ms
(4 rows)

In conclusion

This is actually a very important utility when you're doing this level of testing and you want to be sure that your query that you're testing is not in cache, you want to test cold cache behavior.

Now, with pg_buffercache_evict, and if you have enough system access, you can actually do very fine grained testing on a system that doesn't have much concurrency. So this doesn't really intend to solve all the benchmarking use cases, but I think it's a very useful utility that we can now use to do more reliable testing of our Postgres queries.

Thank you Thomas and Palak for working on this and I'm looking forward to seeing this in Postgres 17.

I hope you learned something new from E110 of 5mins of Postgres. Feel free to subscribe to our YouTube channel, sign up for our newsletter or follow us on LinkedIn and X/Twitter to get updates about new episodes!

What we have discussed in this episode of 5mins of Postgres


Enjoy blog posts like this?

Get them once a month to your inbox