Waiting for Postgres 17: Streaming I/O for sequential scans & ANALYZE

By Lukas Fittl

April 25, 2024

In today’s E112 of “5mins of Postgres” we discuss streaming and vectored I/O in Postgres 17. This is an important step towards implementing asynchronous I/O in Postgres. In Postgres 17 we do not yet have asynchronous I/O, but we do see a performance benefit from what is essentially a refactoring work around the internal APIs.

In this episode, I give you a little bit of a behind the scenes view on my understanding of the changes made in 17 and how they impact future asynchronous I/O work.

Share this episode: Click here to share this episode on LinkedIn or on X/Twitter. Feel free to sign up for our newsletter and subscribe to our YouTube channel.

A better API for read I/O inside Postgres
- The role of the internal ReadBuffer API
- How this fits into Asynchronous I/O (AIO)
Changes in Postgres 17
Testing the performance benefits
In conclusion
What we have discussed in this episode of 5mins of Postgres

Transcript

We'll start with this commit by Thomas Munro, co-authored by Andres Freund is introducing a vectored variant of an important internal API called ReadBuffer().

A better API for read I/O inside Postgres

The role of the internal ReadBuffer API

Now, "buffer" or also sometimes called a page is essentially a fundamental unit in Postgres where Postgres is working with something, for example, its doing a sequential read, it will read many, many buffers to read from your table sequentially.

This important internal API now has a way to start a read and then wait for the result for one or more buffers, instead of reading each one at a time. future this will allow for actual asynchronous I/O to happen, but in the meantime, just the fact that the internal APIs are now refactored to work with multiple blocks at a time, will already allow some performance optimization in certain cases.

How this fits into Asynchronous I/O (AIO)

If you're interested in how this all fits together with asynchronous I/O, Andres Freund, who is one of the principal authors behind the asynchronous I/O work, did a talk at PGConf.EU (slides, see video recording below) where he talked about both the work that Thomas did in 17, to essentially introduce this important building block of the new read stream API, but also how this relates to all the other work that needs to happen to actually see asynchronous I/O in Postgres:

Changes in Postgres 17

New internal APIs for streaming reads

Now there's two main things that the streaming and vectored I/O work in 17 is doing:

systematically issue POSIX_FADV_WILLNEED for random access, replacing patchy ad hoc advice

build larger vectored I/Os; eg one preadv() call can replace 16 pread() calls

Thomas Munro in "Streaming I/O, vectored I/O (WIP)" mailinglist thread

The first change here is to more systematically issue POSIX_FADV_WILLNEED calls for random page access. This information is utilized by the Kernel to decide which pages to load ahead of time into the operating system page cache, making it a lot faster when that information is actually read by Postgres.

This is for example needed when you're doing ANALYZE calls and you're sampling a random portion of your table.

And then the other thing just building larger vectored I/O by making larger I/O requests, for example when doing a sequential scan on a table.

In 17, we have only a small subset of users of these new APIs. The work in Postgres 17 is mainly intended to start testing the waters, see some initial benefits. There is 3 commits worth mentioning:

pg_prewarm, an example in how Postgres code needs to change

First of all Thomas added a simple example by changing how "pg_prewarm" does I/O. pg_prewarm is a contrib extension which is used to load a full relation (table, index) into shared buffers, and will likely see a small performance benefit from this change.

This uses the new Streaming API directly, sidestepping ReadBuffer, since it doesn't actually use the buffers loaded:

@@ -183,18 +203,36 @@ pg_prewarm(PG_FUNCTION_ARGS)
    }
    else if (ptype == PREWARM_BUFFER)
    {
+       struct pg_prewarm_read_stream_private p;
+       ReadStream *stream;
+
        /*
         * In buffer mode, we actually pull the data into shared_buffers.
         */
+
+       /* Set up the private state for our streaming buffer read callback. */
+       p.blocknum = first_block;
+       p.last_block = last_block;
+
+       stream = read_stream_begin_relation(READ_STREAM_FULL,
+                                           NULL,
+                                           rel,
+                                           forkNumber,
+                                           pg_prewarm_read_stream_next_block,
+                                           &p,
+                                           0);
+
        for (block = first_block; block <= last_block; ++block)
        {
            Buffer      buf;
 
            CHECK_FOR_INTERRUPTS();
-           buf = ReadBufferExtended(rel, forkNumber, block, RBM_NORMAL, NULL);
+           buf = read_stream_next_buffer(stream, NULL);
            ReleaseBuffer(buf);
            ++blocks_done;
        }
+       Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer);
+       read_stream_end(stream);
    }
 
    /* Close relation, release lock. */

Faster ANALYZE with streaming I/O

We also have this change authored by Nazir Bilal Yavuz, which added support for ANALYZE to prefetch and read sample blocks using the new API. And this actually does improve ANALYZE performance, in my tests sometimes about 10% performance improvement, just by being smarter about issuing posix_fadvise calls.

The change also shows how the new API actually lets Postgres developers remove some older code, that was doing its own custom prefetching logic:

@@ -1154,10 +1168,7 @@ acquire_sample_rows(Relation onerel, int elevel,
    TableScanDesc scan;
    BlockNumber nblocks;
    BlockNumber blksdone = 0;
-#ifdef USE_PREFETCH
-   int         prefetch_maximum = 0;   /* blocks to prefetch if enabled */
-   BlockSamplerData prefetch_bs;
-#endif
+   ReadStream *stream;
 
    Assert(targrows > 0);
 
@@ -1170,13 +1181,6 @@ acquire_sample_rows(Relation onerel, int elevel,
    randseed = pg_prng_uint32(&pg_global_prng_state);
    nblocks = BlockSampler_Init(&bs, totalblocks, targrows, randseed);
 
-#ifdef USE_PREFETCH
-   prefetch_maximum = get_tablespace_maintenance_io_concurrency(onerel->rd_rel->reltablespace);
-   /* Create another BlockSampler, using the same seed, for prefetching */
-   if (prefetch_maximum)
-       (void) BlockSampler_Init(&prefetch_bs, totalblocks, targrows, randseed);
-#endif
-
    /* Report sampling block numbers */
    pgstat_progress_update_param(PROGRESS_ANALYZE_BLOCKS_TOTAL,
                                 nblocks);
@@ -1187,60 +1191,19 @@ acquire_sample_rows(Relation onerel, int elevel,
    scan = heap_beginscan(onerel, NULL, 0, NULL, NULL, SO_TYPE_ANALYZE);
    slot = table_slot_create(onerel, NULL);
 
-#ifdef USE_PREFETCH
-
-   /*
-    * If we are doing prefetching, then go ahead and tell the kernel about
-    * the first set of pages we are going to want.  This also moves our
-    * iterator out ahead of the main one being used, where we will keep it so
-    * that we're always pre-fetching out prefetch_maximum number of blocks
-    * ahead.
-    */
-   if (prefetch_maximum)
-   {
-       for (int i = 0; i < prefetch_maximum; i++)
-       {
-           BlockNumber prefetch_block;
-
-           if (!BlockSampler_HasMore(&prefetch_bs))
-               break;
-
-           prefetch_block = BlockSampler_Next(&prefetch_bs);
-           PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, prefetch_block);
-       }
-   }
-#endif
+   stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE,
+                                       vac_strategy,
+                                       scan->rs_rd,
+                                       MAIN_FORKNUM,
+                                       block_sampling_read_stream_next,
+                                       &bs,
+                                       0);
 
    /* Outer loop over blocks to sample */
-   while (BlockSampler_HasMore(&bs))
+   while (heapam_scan_analyze_next_block(scan, stream))
    {
-       BlockNumber targblock = BlockSampler_Next(&bs);
-#ifdef USE_PREFETCH
-       BlockNumber prefetch_targblock = InvalidBlockNumber;
-
-       /*
-        * Make sure that every time the main BlockSampler is moved forward
-        * that our prefetch BlockSampler also gets moved forward, so that we
-        * always stay out ahead.
-        */
-       if (prefetch_maximum && BlockSampler_HasMore(&prefetch_bs))
-           prefetch_targblock = BlockSampler_Next(&prefetch_bs);
-#endif
-
        vacuum_delay_point();
 
-       heapam_scan_analyze_next_block(scan, targblock, vac_strategy);
-
-#ifdef USE_PREFETCH
-
-       /*
-        * When pre-fetching, after we get a block, tell the kernel about the
-        * next one we will want, if there's any left.
-        */
-       if (prefetch_maximum && prefetch_targblock != InvalidBlockNumber)
-           PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, prefetch_targblock);
-#endif
-
        while (heapam_scan_analyze_next_tuple(scan, OldestXmin, &liverows, &deadrows, slot))
        {
            /*
@@ -1290,6 +1253,8 @@ acquire_sample_rows(Relation onerel, int elevel,
                                     ++blksdone);
    }
 
+   read_stream_end(stream);
+
    ExecDropSingleTupleTableSlot(slot);
    heap_endscan(scan);

Faster Sequential Scan I/O through larger reads

And now, the change where some people will see a noticable direct performance improvement on their production workloads with 17, is to use streaming I/O for sequential reads, authored by Melanie Plageman. This now means sequential scans in Postgres will use the new API and will do typically do larger reads, going from the old 8kB read syscalls, to 128kB reads by default when reading enough data.

That change is slightly too large to show fully here, but the principle remains the same. Where previously there was a direct call to a variant of ReadBuffer(), there is now a streaming read:

@@ -528,25 +599,23 @@ heap_fetch_next_buffer(HeapScanDesc scan, ScanDirection dir)
         */
        CHECK_FOR_INTERRUPTS();
 
-       if (unlikely(!scan->rs_inited))
+       /*
+        * If the scan direction is changing, reset the prefetch block to the
+        * current block. Otherwise, we will incorrectly prefetch the blocks
+        * between the prefetch block and the current block again before
+        * prefetching blocks in the new, correct scan direction.
+        */
+       if (unlikely(scan->rs_dir != dir))
        {
-               scan->rs_cblock = heapgettup_initial_block(scan, dir);
+               scan->rs_prefetch_block = scan->rs_cblock;
+               read_stream_reset(scan->rs_read_stream);
+       }
 
-               /* ensure rs_cbuf is invalid when we get InvalidBlockNumber */
-               Assert(scan->rs_cblock != InvalidBlockNumber ||
-                          !BufferIsValid(scan->rs_cbuf));
+       scan->rs_dir = dir;
 
-               scan->rs_inited = true;
-       }
-       else
-               scan->rs_cblock = heapgettup_advance_block(scan, scan->rs_cblock,
-                                                                                                  dir);
-
-       /* read block if valid */
-       if (BlockNumberIsValid(scan->rs_cblock))
-               scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM,
-                                                                                  scan->rs_cblock, RBM_NORMAL,
-                                                                                  scan->rs_strategy);
+       scan->rs_cbuf = read_stream_next_buffer(scan->rs_read_stream, NULL);
+       if (BufferIsValid(scan->rs_cbuf))
+               scan->rs_cblock = BufferGetBlockNumber(scan->rs_cbuf);
 }

Testing the performance benefits

The test setup

Let's take a look at that in an example. I have 3 virtual machines running here, one with 16, one with 17, and one with 17 without that streaming sequential scan patch just mentioned. These are running on Amazon EC2, c5.xlarge instance type, with 100GB of io1 storage attached as their root filesystem.

Besides comparing 16 and 17, I also want to isolate other performance improvements in 17, from just that individual addition of the streaming reads API.

On my first virtual machine, I'm just going to create a table with 10 million records:

postgres16=# CREATE TABLE test AS SELECT * FROM generate_series(1,10000000);
SELECT 10000000
Time: 48281.089 ms (00:48.281)

I'm going to do the same on the second one. And the same on the third one, which is 17 without that patch.

This table is about 346 MB:

postgres16=# \dt+ test
                                  List of relations
 Schema | Name | Type  | Owner  | Persistence | Access method |  Size  | Description 
--------+------+-------+--------+-------------+---------------+--------+-------------
 public | test | table | ubuntu | permanent   | heap          | 346 MB | 
(1 row)

Now this system here has a bunch of memory, so this definitely fits in the OS page cache, but it won't fit into shared buffers. "shared_buffers" here on this system is just the default of 128 megabytes:

postgres16=# SHOW shared_buffers;
 shared_buffers 
----------------
 128MB
(1 row)

Also, if you're doing a sequential scan, it will use the ring buffer, and so will not use all of shared buffers when you just read in the whole table.

And I'm going to turn off parallel query here because I don't want this to interfere with the timing information:

SET max_parallel_workers_per_gather = 0;

To keep things simple, in this test we will focus on testing the impact of Postgres reading from the OS page cache, not the underlying disk. That means any performance improvements we'll see are most likely due to reduced syscall overhead, or other internal optimizations between Postgres and the Linux kernel.

Let's look at the performance difference here of running a SELECT COUNT(*) on this table.

Baseline comparison between 16 and 17

Here on 16, on the first run, we get 476 milliseconds:

postgres16=# SELECT COUNT(*) FROM test;
  count   
----------
 10000000
(1 row)

Time: 476.639 ms

I'm going to do the same on 17:

postgres17=# SELECT COUNT(*) FROM test;
  count   
----------
 10000000
(1 row)

Time: 422.464 ms

First run looks promising, but let's run this a few more times. Here we can see the best of 3 on 17 was 410 milliseconds:

postgres17=# SELECT COUNT(*) FROM test;
  count   
----------
 10000000
(1 row)

Time: 410.469 ms

Now let's run a few more on 16 as well. On 16, we can see clearly there is about a 60 - 70 millisecond difference between the two versions.

Clearly in this case, 17 is faster for the sequential scan, noticeably.

The impact of the streaming I/O for sequential scans

Now let's isolate the change here and let's see if this was just because of this patch. So I used the latest development branch, reverted Melanie's patch, and then rebuilt my local Postgres, to look at what kind of difference that makes.

Here, the performance is about 20-30ms slower without the patch, compared to 410 ms earlier:

postgres17-without-streaming-seqscan=# SELECT COUNT(*) FROM test;
  count   
----------
 10000000
(1 row)

Time: 425.217 ms

Now, one other thing we can do is we can run an EXPLAIN ANALYZE. And the reason I like this is because it shows the performance difference actually, a bit more explicitly.

EXPLAIN ANALYZE always as a little bit of overhead, but you can see here that on this particular example on 17, we take about 520 ms with the patch (actual time=0.030..521.069 part of the Seq Scan):

postgres17=# EXPLAIN (ANALYZE, BUFFERS) SELECT COUNT(*) FROM test;
                                                       QUERY PLAN                                                        
-------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=169293.16..169293.17 rows=1 width=8) (actual time=934.996..934.998 rows=1 loops=1)
   Buffers: shared hit=2592 read=41696
   ->  Seq Scan on test  (cost=0.00..144292.13 rows=10000413 width=0) (actual time=0.030..521.069 rows=10000000 loops=1)
         Buffers: shared hit=2592 read=41696
 Planning Time: 0.033 ms
 Execution Time: 935.016 ms
(6 rows)

Time: 935.336 ms
postgres17=#

Versus on 17 without the patch, we take 575 ms (actual time=0.041..575.851):

postgres17-without-streaming-seqscan=# EXPLAIN (ANALYZE, BUFFERS) SELECT COUNT(*) FROM test;
                                                       QUERY PLAN                                                        
-------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=169308.96..169308.97 rows=1 width=8) (actual time=1007.480..1007.481 rows=1 loops=1)
   Buffers: shared hit=2464 read=41824
   ->  Seq Scan on test  (cost=0.00..144304.77 rows=10001677 width=0) (actual time=0.041..575.851 rows=10000000 loops=1)
         Buffers: shared hit=2464 read=41824
 Planning Time: 0.031 ms
 Execution Time: 1007.500 ms
(6 rows)

As we can see, in an EXPLAIN ANALYZE measurement, there is about 50 millisecond difference between just that patch being applied. This stays consistent across multiple runs.

Small reads are a big problem for some filesystems

Additionally, sometimes there is actual performance overhead to doing small reads.

For example, last year around the time Thomas proposed this new patch series, there was a report from somebody on the Postgres performance mailing list, who was using a zstd-compressed btrfs filesystem, that does not do any I/O merging. Clearly this can cause issues with the 8 kB reads that Postgres does today in 16, where the filesystem will have to do a lot of effort for each individual I/O operation.

Now on 17, instead of the reads being 8 kB reads, by default in this case, you'll get 128 kB reads:

postgres=# select count(*) from t;

preadv(46, ..., 8, 256237568) = 131072
preadv(46, ..., 5, 256368640) = 131072
preadv(46, ..., 8, 256499712) = 131072
preadv(46, ..., 5, 256630784) = 131072

But you can go a step further, and use the new "io_combine_limit" Postgres setting to change the size of the I/Os. And so if you want larger I/Os, for in your file system, that makes a difference, you can, for example, say:

SET io_combine_limit = '256k';

And this will cause the read calls that Postgres issues to the OS (for users of the new streaming I/O API), to actually be up to 256k reads:

postgres=# set io_combine_limit = '256k';
postgres=# select count(*) from t;

preadv(47, ..., 5, 613728256) = 262144
preadv(47, ..., 5, 613990400) = 262144
preadv(47, ..., 5, 614252544) = 262144
preadv(47, ..., 5, 614514688) = 262144

In conclusion

As shown, this already provides a nice benefit in 17, and I'm really excited of course about Postgres 18 and all of the asynchronous I/O work that we'll hopefully see there.

Well done, Thomas, Melanie, Nazir and everybody else who worked on this, I'm excited to see an important building block of the AIO work land in 17.

I hope you learned something new from E112 of 5mins of Postgres. Feel free to subscribe to our YouTube channel, sign up for our newsletter or follow us on LinkedIn and X/Twitter to get updates about new episodes!