Waiting for Postgres 17: Streaming I/O for sequential scans & ANALYZE
In today’s E112 of “5mins of Postgres” we discuss streaming and vectored I/O in Postgres 17. This is an important step towards implementing asynchronous I/O in Postgres. In Postgres 17 we do not yet have asynchronous I/O, but we do see a performance benefit from what is essentially a refactoring work around the internal APIs.
In this episode, I give you a little bit of a behind the scenes view on my understanding of the changes made in 17 and how they impact future asynchronous I/O work.
Share this episode: Click here to share this episode on LinkedIn or on X/Twitter. Feel free to sign up for our newsletter and subscribe to our YouTube channel.
Transcript
We'll start with this commit by Thomas Munro, co-authored by Andres Freund is introducing a vectored variant of an important internal API called ReadBuffer().
A better API for read I/O inside Postgres
The role of the internal ReadBuffer API
Now, "buffer" or also sometimes called a page is essentially a fundamental unit in Postgres where Postgres is working with something, for example, its doing a sequential read, it will read many, many buffers to read from your table sequentially.
This important internal API now has a way to start a read and then wait for the result for one or more buffers, instead of reading each one at a time. future this will allow for actual asynchronous I/O to happen, but in the meantime, just the fact that the internal APIs are now refactored to work with multiple blocks at a time, will already allow some performance optimization in certain cases.
How this fits into Asynchronous I/O (AIO)
If you're interested in how this all fits together with asynchronous I/O, Andres Freund, who is one of the principal authors behind the asynchronous I/O work, did a talk at PGConf.EU (slides, see video recording below) where he talked about both the work that Thomas did in 17, to essentially introduce this important building block of the new read stream API, but also how this relates to all the other work that needs to happen to actually see asynchronous I/O in Postgres:
Changes in Postgres 17
New internal APIs for streaming reads
Now there's two main things that the streaming and vectored I/O work in 17 is doing:
- systematically issue POSIX_FADV_WILLNEED for random access, replacing patchy ad hoc advice
- build larger vectored I/Os; eg one preadv() call can replace 16 pread() calls
Thomas Munro in "Streaming I/O, vectored I/O (WIP)" mailinglist thread
The first change here is to more systematically issue POSIX_FADV_WILLNEED
calls for random page access. This information is utilized by the Kernel to decide which pages to load ahead of time into the operating system page cache, making it a lot faster when that information is actually read by Postgres.
This is for example needed when you're doing ANALYZE
calls and you're sampling a random portion of your table.
And then the other thing just building larger vectored I/O by making larger I/O requests, for example when doing a sequential scan on a table.
In 17, we have only a small subset of users of these new APIs. The work in Postgres 17 is mainly intended to start testing the waters, see some initial benefits. There is 3 commits worth mentioning:
pg_prewarm, an example in how Postgres code needs to change
First of all Thomas added a simple example by changing how "pg_prewarm" does I/O. pg_prewarm is a contrib extension which is used to load a full relation (table, index) into shared buffers, and will likely see a small performance benefit from this change.
This uses the new Streaming API directly, sidestepping ReadBuffer, since it doesn't actually use the buffers loaded:
@@ -183,18 +203,36 @@ pg_prewarm(PG_FUNCTION_ARGS)
}
else if (ptype == PREWARM_BUFFER)
{
+ struct pg_prewarm_read_stream_private p;
+ ReadStream *stream;
+
/*
* In buffer mode, we actually pull the data into shared_buffers.
*/
+
+ /* Set up the private state for our streaming buffer read callback. */
+ p.blocknum = first_block;
+ p.last_block = last_block;
+
+ stream = read_stream_begin_relation(READ_STREAM_FULL,
+ NULL,
+ rel,
+ forkNumber,
+ pg_prewarm_read_stream_next_block,
+ &p,
+ 0);
+
for (block = first_block; block <= last_block; ++block)
{
Buffer buf;
CHECK_FOR_INTERRUPTS();
- buf = ReadBufferExtended(rel, forkNumber, block, RBM_NORMAL, NULL);
+ buf = read_stream_next_buffer(stream, NULL);
ReleaseBuffer(buf);
++blocks_done;
}
+ Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer);
+ read_stream_end(stream);
}
/* Close relation, release lock. */
Faster ANALYZE with streaming I/O
We also have this change authored by Nazir Bilal Yavuz, which added support for ANALYZE to prefetch and read sample blocks using the new API. And this actually does improve ANALYZE performance, in my tests sometimes about 10% performance improvement, just by being smarter about issuing posix_fadvise calls.
The change also shows how the new API actually lets Postgres developers remove some older code, that was doing its own custom prefetching logic:
@@ -1154,10 +1168,7 @@ acquire_sample_rows(Relation onerel, int elevel,
TableScanDesc scan;
BlockNumber nblocks;
BlockNumber blksdone = 0;
-#ifdef USE_PREFETCH
- int prefetch_maximum = 0; /* blocks to prefetch if enabled */
- BlockSamplerData prefetch_bs;
-#endif
+ ReadStream *stream;
Assert(targrows > 0);
@@ -1170,13 +1181,6 @@ acquire_sample_rows(Relation onerel, int elevel,
randseed = pg_prng_uint32(&pg_global_prng_state);
nblocks = BlockSampler_Init(&bs, totalblocks, targrows, randseed);
-#ifdef USE_PREFETCH
- prefetch_maximum = get_tablespace_maintenance_io_concurrency(onerel->rd_rel->reltablespace);
- /* Create another BlockSampler, using the same seed, for prefetching */
- if (prefetch_maximum)
- (void) BlockSampler_Init(&prefetch_bs, totalblocks, targrows, randseed);
-#endif
-
/* Report sampling block numbers */
pgstat_progress_update_param(PROGRESS_ANALYZE_BLOCKS_TOTAL,
nblocks);
@@ -1187,60 +1191,19 @@ acquire_sample_rows(Relation onerel, int elevel,
scan = heap_beginscan(onerel, NULL, 0, NULL, NULL, SO_TYPE_ANALYZE);
slot = table_slot_create(onerel, NULL);
-#ifdef USE_PREFETCH
-
- /*
- * If we are doing prefetching, then go ahead and tell the kernel about
- * the first set of pages we are going to want. This also moves our
- * iterator out ahead of the main one being used, where we will keep it so
- * that we're always pre-fetching out prefetch_maximum number of blocks
- * ahead.
- */
- if (prefetch_maximum)
- {
- for (int i = 0; i < prefetch_maximum; i++)
- {
- BlockNumber prefetch_block;
-
- if (!BlockSampler_HasMore(&prefetch_bs))
- break;
-
- prefetch_block = BlockSampler_Next(&prefetch_bs);
- PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, prefetch_block);
- }
- }
-#endif
+ stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE,
+ vac_strategy,
+ scan->rs_rd,
+ MAIN_FORKNUM,
+ block_sampling_read_stream_next,
+ &bs,
+ 0);
/* Outer loop over blocks to sample */
- while (BlockSampler_HasMore(&bs))
+ while (heapam_scan_analyze_next_block(scan, stream))
{
- BlockNumber targblock = BlockSampler_Next(&bs);
-#ifdef USE_PREFETCH
- BlockNumber prefetch_targblock = InvalidBlockNumber;
-
- /*
- * Make sure that every time the main BlockSampler is moved forward
- * that our prefetch BlockSampler also gets moved forward, so that we
- * always stay out ahead.
- */
- if (prefetch_maximum && BlockSampler_HasMore(&prefetch_bs))
- prefetch_targblock = BlockSampler_Next(&prefetch_bs);
-#endif
-
vacuum_delay_point();
- heapam_scan_analyze_next_block(scan, targblock, vac_strategy);
-
-#ifdef USE_PREFETCH
-
- /*
- * When pre-fetching, after we get a block, tell the kernel about the
- * next one we will want, if there's any left.
- */
- if (prefetch_maximum && prefetch_targblock != InvalidBlockNumber)
- PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, prefetch_targblock);
-#endif
-
while (heapam_scan_analyze_next_tuple(scan, OldestXmin, &liverows, &deadrows, slot))
{
/*
@@ -1290,6 +1253,8 @@ acquire_sample_rows(Relation onerel, int elevel,
++blksdone);
}
+ read_stream_end(stream);
+
ExecDropSingleTupleTableSlot(slot);
heap_endscan(scan);
Faster Sequential Scan I/O through larger reads
And now, the change where some people will see a noticable direct performance improvement on their production workloads with 17, is to use streaming I/O for sequential reads, authored by Melanie Plageman. This now means sequential scans in Postgres will use the new API and will do typically do larger reads, going from the old 8kB read syscalls, to 128kB reads by default when reading enough data.
That change is slightly too large to show fully here, but the principle remains the same. Where previously there was a direct call to a variant of ReadBuffer(), there is now a streaming read:
@@ -528,25 +599,23 @@ heap_fetch_next_buffer(HeapScanDesc scan, ScanDirection dir)
*/
CHECK_FOR_INTERRUPTS();
- if (unlikely(!scan->rs_inited))
+ /*
+ * If the scan direction is changing, reset the prefetch block to the
+ * current block. Otherwise, we will incorrectly prefetch the blocks
+ * between the prefetch block and the current block again before
+ * prefetching blocks in the new, correct scan direction.
+ */
+ if (unlikely(scan->rs_dir != dir))
{
- scan->rs_cblock = heapgettup_initial_block(scan, dir);
+ scan->rs_prefetch_block = scan->rs_cblock;
+ read_stream_reset(scan->rs_read_stream);
+ }
- /* ensure rs_cbuf is invalid when we get InvalidBlockNumber */
- Assert(scan->rs_cblock != InvalidBlockNumber ||
- !BufferIsValid(scan->rs_cbuf));
+ scan->rs_dir = dir;
- scan->rs_inited = true;
- }
- else
- scan->rs_cblock = heapgettup_advance_block(scan, scan->rs_cblock,
- dir);
-
- /* read block if valid */
- if (BlockNumberIsValid(scan->rs_cblock))
- scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM,
- scan->rs_cblock, RBM_NORMAL,
- scan->rs_strategy);
+ scan->rs_cbuf = read_stream_next_buffer(scan->rs_read_stream, NULL);
+ if (BufferIsValid(scan->rs_cbuf))
+ scan->rs_cblock = BufferGetBlockNumber(scan->rs_cbuf);
}
Testing the performance benefits
The test setup
Let's take a look at that in an example. I have 3 virtual machines running here, one with 16, one with 17, and one with 17 without that streaming sequential scan patch just mentioned. These are running on Amazon EC2, c5.xlarge instance type, with 100GB of io1
storage attached as their root filesystem.
Besides comparing 16 and 17, I also want to isolate other performance improvements in 17, from just that individual addition of the streaming reads API.
On my first virtual machine, I'm just going to create a table with 10 million records:
postgres16=# CREATE TABLE test AS SELECT * FROM generate_series(1,10000000);
SELECT 10000000
Time: 48281.089 ms (00:48.281)
I'm going to do the same on the second one. And the same on the third one, which is 17 without that patch.
This table is about 346 MB:
postgres16=# \dt+ test
List of relations
Schema | Name | Type | Owner | Persistence | Access method | Size | Description
--------+------+-------+--------+-------------+---------------+--------+-------------
public | test | table | ubuntu | permanent | heap | 346 MB |
(1 row)
Now this system here has a bunch of memory, so this definitely fits in the OS page cache, but it won't fit into shared buffers. "shared_buffers" here on this system is just the default of 128 megabytes:
postgres16=# SHOW shared_buffers;
shared_buffers
----------------
128MB
(1 row)
Also, if you're doing a sequential scan, it will use the ring buffer, and so will not use all of shared buffers when you just read in the whole table.
And I'm going to turn off parallel query here because I don't want this to interfere with the timing information:
SET max_parallel_workers_per_gather = 0;
To keep things simple, in this test we will focus on testing the impact of Postgres reading from the OS page cache, not the underlying disk. That means any performance improvements we'll see are most likely due to reduced syscall overhead, or other internal optimizations between Postgres and the Linux kernel.
Let's look at the performance difference here of running a SELECT COUNT(*)
on this table.
Baseline comparison between 16 and 17
Here on 16, on the first run, we get 476 milliseconds:
postgres16=# SELECT COUNT(*) FROM test;
count
----------
10000000
(1 row)
Time: 476.639 ms
I'm going to do the same on 17:
postgres17=# SELECT COUNT(*) FROM test;
count
----------
10000000
(1 row)
Time: 422.464 ms
First run looks promising, but let's run this a few more times. Here we can see the best of 3 on 17 was 410 milliseconds:
postgres17=# SELECT COUNT(*) FROM test;
count
----------
10000000
(1 row)
Time: 410.469 ms
Now let's run a few more on 16 as well. On 16, we can see clearly there is about a 60 - 70 millisecond difference between the two versions.
Clearly in this case, 17 is faster for the sequential scan, noticeably.
The impact of the streaming I/O for sequential scans
Now let's isolate the change here and let's see if this was just because of this patch. So I used the latest development branch, reverted Melanie's patch, and then rebuilt my local Postgres, to look at what kind of difference that makes.
Here, the performance is about 20-30ms slower without the patch, compared to 410 ms earlier:
postgres17-without-streaming-seqscan=# SELECT COUNT(*) FROM test;
count
----------
10000000
(1 row)
Time: 425.217 ms
Now, one other thing we can do is we can run an EXPLAIN ANALYZE
. And the reason I like this is because it shows the performance difference actually, a bit more explicitly.
EXPLAIN ANALYZE
always as a little bit of overhead, but you can see here that on this particular example on 17, we take about 520 ms with the patch (actual time=0.030..521.069
part of the Seq Scan
):
postgres17=# EXPLAIN (ANALYZE, BUFFERS) SELECT COUNT(*) FROM test;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=169293.16..169293.17 rows=1 width=8) (actual time=934.996..934.998 rows=1 loops=1)
Buffers: shared hit=2592 read=41696
-> Seq Scan on test (cost=0.00..144292.13 rows=10000413 width=0) (actual time=0.030..521.069 rows=10000000 loops=1)
Buffers: shared hit=2592 read=41696
Planning Time: 0.033 ms
Execution Time: 935.016 ms
(6 rows)
Time: 935.336 ms
postgres17=#
Versus on 17 without the patch, we take 575 ms (actual time=0.041..575.851
):
postgres17-without-streaming-seqscan=# EXPLAIN (ANALYZE, BUFFERS) SELECT COUNT(*) FROM test;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=169308.96..169308.97 rows=1 width=8) (actual time=1007.480..1007.481 rows=1 loops=1)
Buffers: shared hit=2464 read=41824
-> Seq Scan on test (cost=0.00..144304.77 rows=10001677 width=0) (actual time=0.041..575.851 rows=10000000 loops=1)
Buffers: shared hit=2464 read=41824
Planning Time: 0.031 ms
Execution Time: 1007.500 ms
(6 rows)
As we can see, in an EXPLAIN ANALYZE measurement, there is about 50 millisecond difference between just that patch being applied. This stays consistent across multiple runs.
Small reads are a big problem for some filesystems
Additionally, sometimes there is actual performance overhead to doing small reads.
For example, last year around the time Thomas proposed this new patch series, there was a report from somebody on the Postgres performance mailing list, who was using a zstd-compressed btrfs filesystem, that does not do any I/O merging. Clearly this can cause issues with the 8 kB reads that Postgres does today in 16, where the filesystem will have to do a lot of effort for each individual I/O operation.
Now on 17, instead of the reads being 8 kB reads, by default in this case, you'll get 128 kB reads:
postgres=# select count(*) from t;
preadv(46, ..., 8, 256237568) = 131072
preadv(46, ..., 5, 256368640) = 131072
preadv(46, ..., 8, 256499712) = 131072
preadv(46, ..., 5, 256630784) = 131072
But you can go a step further, and use the new "io_combine_limit" Postgres setting to change the size of the I/Os. And so if you want larger I/Os, for in your file system, that makes a difference, you can, for example, say:
SET io_combine_limit = '256k';
And this will cause the read
calls that Postgres issues to the OS (for users of the new streaming I/O API), to actually be up to 256k reads:
postgres=# set io_combine_limit = '256k';
postgres=# select count(*) from t;
preadv(47, ..., 5, 613728256) = 262144
preadv(47, ..., 5, 613990400) = 262144
preadv(47, ..., 5, 614252544) = 262144
preadv(47, ..., 5, 614514688) = 262144
In conclusion
As shown, this already provides a nice benefit in 17, and I'm really excited of course about Postgres 18 and all of the asynchronous I/O work that we'll hopefully see there.
Well done, Thomas, Melanie, Nazir and everybody else who worked on this, I'm excited to see an important building block of the AIO work land in 17.
I hope you learned something new from E112 of 5mins of Postgres. Feel free to subscribe to our YouTube channel, sign up for our newsletter or follow us on LinkedIn and X/Twitter to get updates about new episodes!
What we have discussed in this episode of 5mins of Postgres
- Provide vectored variant of ReadBuffer() - Postgres commit by Thomas Munro
- Provide API for streaming relation data - Postgres commit by Thomas Munro
- "aio" directory in Postgres source
- The path to using AIO in Postgres - Talk by Andres Freund at PGConf.EU 2023 (talk recording)
- Streaming I/O, vectored I/O (WIP) - Mailinglist thread on pgsql-hackers
- Use streaming I/O in pg_prewarm - Postgres commit by Thomas Munro
- Use streaming I/O in ANALYZE - Postgres commit by Thomas Munro, authored by Nazir Bilal Yavuz
- Use streaming I/O in sequential scans - Postgres commit by Thomas Munro, authored by Melanie Plageman
- Performance implications of 8K pread()s - Mailinglist thread on pgsql-performance