Replication - High Lag

Check Frequency

Every 30 minutes

Default Configuration

Detects when the replication lag on a follower, averaged over the last hour, exceeds 100MB and creates an issue with severity "warning". Escalates to "critical" once replication lag exceeds 1024MB. Resolves once the lag falls below 100MB.

Configure this on the primary server in your replication setup.

This check is enabled by default. These parameters can be tuned in the Configure section of the Alerts & Check-Up page.

Guidance

Impact

Depending on your settings, this may cause disk space issues on the primary, or may eventually cause the replica to permanently fail replication (unless WAL archiving is also in use and the necessary WAL files are available on the replica).

Common Causes

  • Long-running transactions

    If the Postgres setting hot_standby_feedback is turned on, long-running transactions on the standby may prevent replication progress. Check for any long-running transactions on the standby on the Connections page. You can also tune max_standby_streaming_delay to limit long-running transactions that can run on the standby.

  • Underpowered standby

    If the standby is running on a less powerful hardware configuration than the primary (especially in terms of I/O capabilities), it may not be able to keep up with replaying the replicated activity. You may need to upgrade the hardware on the standby.

  • Limited network bandwidth

    If the primary is generating writes at a rate greater than the bandwidth of the streaming replication connection, the replica may not be able to keep up due to network limitations. You may need to increase the network bandwidth. Most cloud providers limit network bandwidth based on instance size (bigger instances have more network bandwidth).

  • Logical replication issues
    If this is a logical replication slot you may want to check Log Insights on both the publisher and the subscriber for any error messages regarding logical replication, such as schema differences or too low wal_sender_timeout/wal_receiver_timeout settings, causing the workers to quit after a while without making progress.

Couldn't find what you were looking for or want to talk about something specific?
Start a conversation with us →