5mins of Postgres E44: Reducing replication lag with maintenance_io_concurrency in Postgres 15
Today, we're talking about reducing replication lag by setting the maintenance_io_concurrency setting in Postgres 15. We're also talking about true asynchronous I/O in Postgres and how to use "recovery prefetch".
Let's have a look!
In this article by Thomas Munro on the Azure Database for Postgres blog, he talks about an improvement that he made to Postgres 15 that reduces replication lag by using what's called "recovery prefetch".
When Thomas references the concept of recovery, what he means is the replica applying the replication stream to the actual data directory by looking at each WAL record.
What Thomas shows in his blog post is how he ran a pgbench and used a scale factor that doesn't fit into RAM. This is important because if things fit into RAM, then you don't need additional I/O concurrency. Really, it's just a problem of working with the working set in memory.
But, once you go outside of the working set being in memory, then you care about I/O concurrency at recovery.
When does this matter?
- If you have very large databases
- If you have full-page writes turned off, because that causes a lot of random I/O
So, in these situations the new
maintenance_io_concurrency used for recovery is going to be useful.
Now, in this test setup, Thomas uses a
pgbench with a scale factor of 1000 and 32 clients on the primary. He also has a streaming replica with
maintenance_io_concurrency either set to 0 - to behave like Postgres 14 has behaved - or set to 10, which is the Postgres 15 default, or higher values to showcase how that behaves.
Thomas has a graph in his article that is really showing the main impact of this change. On the very left side of this graph, in the first 30 seconds, you can see when the
maintenance_io_concurrency setting is set to 0. At that point, the replication lag on the replica will increase quite drastically! The reason for that is that these 32 clients on the
pgbench do a lot of writes on the primary system. This means they produce a lot of WAL. The secondary just can't keep up with that WAL because the secondary looks at one WAL record at a time, fetches the page necessary, and then applies that.
Thomas goes and increments the
maintenance_io_concurrency. I encourage you to check out his graph: You can see where the blue line starts. The green dotted line is the
maintenance_io_concurrency setting. The
maintenance_io_concurrency is set to 10: that means that Postgres will go and will advise the Linux system which pages to look for that it will need momentarily. It still reads one page at a time, but it gives Linux a heads up and says, "Hey, I'm about to need these other pages".
It still reads one page at a time, but it gives Linux a heads up and says, "Hey, I'm about to need these other pages".
The I/O concurrency follows pretty closely to the
maintenance_io_concurrency setting as a maximum. When looking at the graph, you can see that replication lag decreases when it's set to 10. Then, when it's set to 30, the replication lag goes down first and then goes away completely.
After this, Thomas steps it down again to see what the ideal
maintenance_io_concurrency setting is in his scenario, and ultimately ends on a setting of 16, showing that this essentially keeps the replication lag at a minimum.
Now, it's important to note that this is an advice based scheme. The way this works is that Postgres gives Linux a heads up and says, "I'm about to need these pages, please read them into the OS page cache for me, so that when I actually need them, they're already cached".
Really, this tries to solve recovery stalling on I/O. If you don't have this prefetching, what happens is that the I/O will take time and so as recovery happens, the recovery process will wait for the I/O to complete and then do the next one. If you have a lot of clients that do writes, then that causes a lot of I/O stalls on the secondary causing high replication lag.
If you have a lot of clients that do writes, then that causes a lot of I/O stalls on the secondary causing high replication lag.
Personally, I think this is a pretty cool improvement and this uses the
posix_fadvise call, which is not true async I/O. This is just having the Linux kernel do async I/O for us because Postgres 15 does not yet have async I/O capabilities.
What's really exciting though is that Thomas and team are working on true asynchronous I/O. I think the next two Postgres releases will see a lot of exciting improvements around async I/O. Async I/O is what we need to get much more performance out of Postgres in a typical cloud environment, because you then can increase the I/O concurrency and still be assured that Postgres does not stall.
Pretty exciting stuff! When you upgrade to Postgres 15 you can tweak the
maintenance_io_concurrency setting to reduce replication lag and get fast recovery in Postgres.
Thanks so much for joining us for today's episode of 5mins of Postgres. Subscribe to our YouTube channel, sign up for our newsletter and follow us on LinkedIn and Twitter to get updates about new episodes!