Replication - High Lag
Check Frequency
Every 30 minutes
Default Configuration
Detects when the replication lag on a follower, averaged over the last hour, exceeds 100MB
and creates an issue with severity "warning". Escalates to "critical" once replication lag exceeds 1024MB
. Resolves once the lag falls below 100MB
.
Configure this on the primary server in your replication setup.
This check is enabled by default. These parameters can be tuned in the Configure section of the Alerts & Check-Up page.
Guidance
Impact
Depending on your settings, this may cause disk space issues on the primary, or may eventually cause the replica to permanently fail replication (unless WAL archiving is also in use and the necessary WAL files are available on the replica).
Common Causes
Long-running transactions
If the Postgres setting
hot_standby_feedback
is turned on, long-running transactions on the standby may prevent replication progress. Check for any long-running transactions on the standby on the Connections page. You can also tunemax_standby_streaming_delay
to limit long-running transactions that can run on the standby.Underpowered standby
If the standby is running on a less powerful hardware configuration than the primary (especially in terms of I/O capabilities), it may not be able to keep up with replaying the replicated activity. You may need to upgrade the hardware on the standby.
Limited network bandwidth
If the primary is generating writes at a rate greater than the bandwidth of the streaming replication connection, the replica may not be able to keep up due to network limitations. You may need to increase the network bandwidth. Most cloud providers limit network bandwidth based on instance size (bigger instances have more network bandwidth).
Logical replication issues
If this is a logical replication slot you may want to check Log Insights on both the publisher and the subscriber for any error messages regarding logical replication, such as schema differences or too low wal_sender_timeout/wal_receiver_timeout settings, causing the workers to quit after a while without making progress.
Couldn't find what you were looking for or want to talk about something specific?
Start a conversation with us →