Postgres 16: Logical decoding on standbys
In E66 of “5mins of Postgres” we're talking about logical decoding on a standby in Postgres 16, which allows us to keep following the logical replication stream whilst a failover happens. We’ll also touch upon the new pg_log_standby_snapshot function.
Let's jump right in!
This is a blog post by Bertrand from the AWS Postgres Contributors team. In this blog post, he describes a feature that he co-authored in Postgres 16.
Before we jump into this I'm going to point out that this feature has been worked on for a long time. Back in 2018, Andres Freund started a thread about minimal logical decoding on standby, which aimed to achieve what now in Postgres 16 is reality.
Now, shortly before the feature freeze for Postgres 16, this feature was pushed. "Pushed" means this is in Postgres 16 as of right now, but because Postgres 16 still will undergo multiple months of testing, there is of course a small risk that this won't make it, but I think chances are good.
What Bertrand describes in his blog post is how this new feature works and how you can benefit from it.
First of all, before this commit, when you tried to create a logical replication slot on a standby server in Postgres, Postgres would actually give you an error and say
logical decoding can not be used while in recovery.
Logical decoding, just to be clear, is what's used underneath the hood of logical replication and what also often powers other processes that extract data from Postgres, for example to use in the application, or when you're doing transfers to other systems.
With the change in Postgres 16, when you're on a standby, you can actually create a logical replication slot. There are a few mechanics around this that make this work slightly differently than if you were creating a logical replication slot on the primary.
Most importantly, you might be waiting a little bit longer for the slot to actually be created. To speed that up, you can run the new
pg_log_standby_snapshot function on the primary. What this does behind the scenes is that it produces a special WAL record that is required on the secondary. There's a whole other story around this if you really want to dive deep.
But the point is, if you're seeing yourself wait a long time on the logical replication slot to be created, call this function to speed it up on the primary.
If you're seeing yourself wait a long time on the logical replication slot to be created, call this function to speed it up on the primary.
Let's take a look at Betrand’s example: We have a table called
bdt that just has a single column. We can see the logical decoding message
INSERT, we can see the data that was being inserted. This is fairly standard, but what is special about this is that you're running all of this whilst you're on a standby.
Now, the really awesome thing about this is that you can use this whilst the standby gets promoted. Let's say you run the
pg_recvlogical command, which outputs the replication stream in a separate process. Now, let's say you're promoting the standby to be the new primary. And what's amazing is that you could actually observe that the
pg_recvlogical process has not been interrupted. It can keep following the logical replication stream whilst the failover happens.
Second, in the newly promoted primary, we can now insert data and you can see that the
pg_recvlogical process that was reading the replication stream seamlessly got that data.
You can imagine that if they're subscribers, they can keep having an open connection to the standby and if the standby gets promoted, they can seamlessly follow without having to worry about reset connections or making any other changes.
There's one more thing to note here: This doesn't just apply to decoding, but also to logical replication! You create a table, and then it's important to note that you would have to create the publication on the primary because publications in Postgres are catalog objects, you couldn't create those on the secondary. Then you create a subscription, but instead of getting an error, you will now simply get
created replication slot – and it will just work!
I think this really enables a lot of new things in Postgres because previously logical decoding could become a bottleneck. You can imagine everything needing to happen on the primary means that actually reading the WAL for logical decoding and all that work had to happen on the primary, puts a limit to how much decoding you can do. Also, just the fact that this can seamlessly follow failovers will enable more resilient architectures that rely on logical replication or logical decoding throughout how they're built.
There are a few things to understand in certain conditions. Most importantly, if you're running
hot_standby_feedback off, this will actually be a situation where there is a higher chance of certain conflict scenarios where ultimately the replication slots on the standby would be invalidated.
Just be aware that in some cases you might see slots being invalidated. It’s important to be able to check for that and to actually respond to that.
All in all, a really important building block for more sophisticated architectures that rely on Postgres, and I look forward to using it in Postgres 16.