Skip to content

Implement replication lag detection for automatic replica traffic management #235

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

Copilot
Copy link

@Copilot Copilot AI commented Jun 27, 2025

This PR implements replication lag detection for pgdog, enabling automatic traffic management based on replica lag status. When replicas fall behind, they are automatically excluded from traffic until they catch up.

Features

🔄 Automatic Lag Detection

  • Monitors replication lag by querying pg_stat_replication on the primary server
  • Compares replica flush LSN with primary's current WAL LSN using pg_current_wal_flush_lsn()
  • Calculates lag in bytes with configurable thresholds

🚦 Traffic Management

  • Automatically bans replicas exceeding the lag threshold
  • Integrates seamlessly with existing ban/unban mechanism
  • Restores traffic to replicas once lag is reduced

⚙️ Configuration

[general]
# Check replication lag every 10 seconds (default)
replication_lag_check_interval = 10_000

# Ban replicas lagging by more than 1MB (default)  
max_replication_lag_bytes = 1048576

🏗️ Implementation Details

  • Non-blocking monitoring: Runs as background task per shard
  • Robust error handling: Continues monitoring even if individual checks fail
  • PostgreSQL best practices: Uses standard replication monitoring views
  • Zero configuration: Works out-of-the-box with sensible defaults

Example Usage

With a 3-replica setup, if replica-2 starts lagging:

INFO  replica 127.0.0.1:5434 lag: 512KB (within limit: 1MB)
INFO  replica 127.0.0.1:5435 lag: 256KB (within limit: 1MB)  
WARN  replica 127.0.0.1:5436 is lagging by 1.5MB (max: 1MB), banning

Traffic automatically shifts to healthy replicas. Once replica-2 catches up, it's automatically re-enabled.

Testing

Added comprehensive unit tests covering:

  • Configuration validation
  • LSN parsing logic
  • Error handling scenarios
  • Integration with existing pool management

Fixes #215.


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

1 similar comment
@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@Copilot Copilot AI changed the title [WIP] Replication lag detection Implement replication lag detection for automatic replica traffic management Jun 27, 2025
@Copilot Copilot AI requested a review from christoudias June 27, 2025 19:33
Copilot finished work on behalf of christoudias June 27, 2025 19:33
@christoudias christoudias marked this pull request as ready for review June 27, 2025 19:35
@vitabaks
Copy link

@levkk please consider increasing the priority of this feature, it is extremely important for using read balancing in production.

@@ -12,6 +12,10 @@ read_write_strategy = "aggressive"
prepared_statements_limit = 500
# client_idle_timeout = 5_000

# Replication lag detection settings
# replication_lag_check_interval = 10_000 # Check every 10 seconds (default)
# max_replication_lag_bytes = 1048576 # Ban replicas lagging by more than 1MB (default)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# max_replication_lag_bytes = 1048576 # Ban replicas lagging by more than 1MB (default)
# max_replication_lag_bytes = 1048576 # Ban replicas lagging by more than 1MB (default)

this threshold is too low, and it can lead to frequent ban of the replica. I think it should be at least 10MB or even 100MB by default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Replication lag detection
4 participants