Waiting for Postgres 17: Incremental base backups
In E96 of “5mins of Postgres”, we talk about the incremental backup feature that was just committed to the Postgres 17 development branch. To be clear, this is very much off the presses and this may yet change or be removed completely.
I was very excited earlier this week when I saw that Robert Haas committed a way to have incremental backups directly in core Postgres.
When you run
pg_basebackup, you now have the option to do incremental backups. The idea is that you could essentially say: “I do a full backup every 24 hours or every 7 days, and then do incremental backups that reference that full backup” – and the incremental backups would only contain the parts of the data files that have changed.
Note that this is different from what you do with the Write Ahead Log or "WAL" in Postgres. Let's imagine we have a point in time restore, and we want to recover to a point in time, yesterday at 5:00 PM, because something bad happened right after. Of course, we want to make sure that we can recover to the known good time. I can do that today in Postgres, but what I have to do is to go back to my most recent base backup, restore it first, and then replay the WAL from my WAL archive until the point that I want to restore to.
When you're replaying WAL, that actually has to apply all the changes that have occurred on the database. Imagine you're updating the same row over and over again. You’ll essentially have a lot of duplicate information in the WAL stream.
Instead, with an incremental backup you actually have snapshots in time and you have the actual diff between the time you took the backup to the previous backup. It's much more efficient when you're doing a lot of updates because you have to do less work to get to a particular point in time, assuming you're doing more frequent backups thanks to the incremental backup feature.
Now, let's take a closer look at this.
An important component to making this work is the WAL summarizer process. This is a new process that was added, which aids incremental backups in determining which parts of the data directory have changed as compared to the last time a backup ran.
The idea is that you have a file which tells us which part of the table, or the index, have changed, and so then the incremental backup just needs to copy those parts of the files and it doesn't have to copy the full data file.
This is very efficient and whilst you could try to do something like this with other utilities, the benefit here is that this is already happening as Postgres is running, and so it's just saving additional metadata that then can be used by the incremental backup process. This is an optional feature, so we actually need to enable this flag to turn on the WAL summarizer.
The other thing I found interesting was to learn about how much of a difference this could make? Why do we need incremental backups in the first place?
In this mailing list thread on the Postgres-hackers mailing list, Jakub from EDB ran a test. This is a
pgbench test. The idea is that the data size doesn't really change much throughout this test. This is a 24 hour long test. At the start the database is 3.3GB. At the end, the database is 4.3GB. Then, as it's running, it's continuously running
pgbench workloads. In those 24 hours, if you looked at the WAL archive, there were 77 GB of WAL produced.
That's a lot of WAL to replay if you wanted to restore to a particular point in time within that timeframe!
Jakub ran one full backup in the beginning and then incremental backups every two hours. The full backup in the beginning is 3.4 GB, but then all the 11 other backups are 3.5 in total, they're essentially one 10th of a full backup size.
What Jakub tested then was the restore to a particular point in time. Previously, to restore to a particular point in time would take more than an hour to replay the WAL versus in this case because we have more frequent, incremental backups, it's going to be much, much faster to restore. In this particular test case 78 minutes compared to 4 minutes. This is a more than a 10 times improvement in recovery time. Of course you won't necessarily always see this amount of benefit, but I think this shows why you might want to do this. It is because you want to enable more frequent backups and incremental backups are the way to do that.
Let's try this out. I have, locally, a running Postgres 17 development version. Do not trust this for any relevant data, right? This is very much heavy in development, but let's take a look at how this works.
Before we get started, we need to turn on the WAL summarizer process. I'm going to do
ALTER SYSTEM SET summarize_wal = on, and then I'm going to do a
SELECT pg_reload_conf(). This luckily is a reloadable setting. If you want to turn this on after the fact, you don't have to restart your database server, which is nice.
Now WAL summaries are going to be written. The idea is that each time you're running a checkpoint, that you will get another summary file here.
We want to do a base backup first, so we'll do pg_basebackup and we'll point this to the "backup1" directory. And then to speed this up, we'll actually do a CHECKPOINT because otherwise it does a slow checkpoint. In this case for testing, we want to do that fast. Now we have a "backup1" directory, 350 MB in size.
Let's make an incremental backup and let's pass the backup manifest of that previous backup.
Our incremental backup in this case is only 48 MB.
If I do "backup3", I can pass my "backup2" backup manifest, the incremental backups can rely on each other, and then if I take a look at that, that's even smaller than the second one.
Now, what I need to do to be able to restore to the point of that "backup3", I now need to use
pg_combinebackup. I combine "backup1", "backup2", "backup3" to "backup3_full". Perfect. And so now if we take a look at "backup3_full", we should see that that one has the full size of a full backup.
And then let's start up another server. If we look at the log file, we'll see that that one came up successfully and we can see that this the data as of a few minutes ago. Now we've successfully taken incremental backups, we've combined them into one working backup, and then we restored that server. This was obviously a lot faster than if I had to replay the WAL.
Overall, this feature will save a lot of time and pain waiting for WAL replay because you can now do more frequent incremental backups and it's great to see this is part of Postgres core so that people can just do
pg_basebackup and then use the incremental mode to have incremental backups work.
Very exciting. I hope this stays in Postgres 17, maybe it gets even further improvements, and so then next year, September, October timeframe, we can actually use this in production. And before we close: A quick logistical note, we're going to take a break over the holidays, and we'll be back with 5mins of Postgres on January 11th.