Implement replication lag detection for automatic replica traffic management #235

Copilot · 2025-06-27T19:15:30Z

This PR implements replication lag detection for pgdog, enabling automatic traffic management based on replica lag status. When replicas fall behind, they are automatically excluded from traffic until they catch up.

Features

🔄 Automatic Lag Detection

Monitors replication lag by querying pg_stat_replication on the primary server
Compares replica flush LSN with primary's current WAL LSN using pg_current_wal_flush_lsn()
Calculates lag in bytes with configurable thresholds

🚦 Traffic Management

Automatically bans replicas exceeding the lag threshold
Integrates seamlessly with existing ban/unban mechanism
Restores traffic to replicas once lag is reduced

⚙️ Configuration

[general]
# Check replication lag every 10 seconds (default)
replication_lag_check_interval = 10_000

# Ban replicas lagging by more than 1MB (default)  
max_replication_lag_bytes = 1048576

🏗️ Implementation Details

Non-blocking monitoring: Runs as background task per shard
Robust error handling: Continues monitoring even if individual checks fail
PostgreSQL best practices: Uses standard replication monitoring views
Zero configuration: Works out-of-the-box with sensible defaults

Example Usage

With a 3-replica setup, if replica-2 starts lagging:

INFO  replica 127.0.0.1:5434 lag: 512KB (within limit: 1MB)
INFO  replica 127.0.0.1:5435 lag: 256KB (within limit: 1MB)  
WARN  replica 127.0.0.1:5436 is lagging by 1.5MB (max: 1MB), banning

Traffic automatically shifts to healthy replicas. Once replica-2 catches up, it's automatically re-enabled.

Testing

Added comprehensive unit tests covering:

Configuration validation
LSN parsing logic
Error handling scenarios
Integration with existing pool management

Fixes #215.

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Co-authored-by: christoudias <[email protected]>

CLAassistant · 2025-06-27T19:31:20Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

CLAassistant · 2025-06-27T19:31:21Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

Co-authored-by: christoudias <[email protected]>

vitabaks · 2025-07-10T18:12:16Z

@levkk please consider increasing the priority of this feature, it is extremely important for using read balancing in production.

vitabaks · 2025-07-10T18:22:55Z

pgdog.toml

@@ -12,6 +12,10 @@ read_write_strategy = "aggressive"
 prepared_statements_limit = 500
 # client_idle_timeout = 5_000

+# Replication lag detection settings
+# replication_lag_check_interval = 10_000  # Check every 10 seconds (default)
+# max_replication_lag_bytes = 1048576      # Ban replicas lagging by more than 1MB (default)


Suggested change

# max_replication_lag_bytes = 1048576 # Ban replicas lagging by more than 1MB (default)

# max_replication_lag_bytes = 1048576 # Ban replicas lagging by more than 1MB (default)

this threshold is too low, and it can lead to frequent ban of the replica. I think it should be at least 10MB or even 100MB by default.

Initial plan

0e6c1b3

Copilot AI assigned Copilot and christoudias Jun 27, 2025

Copilot started work on behalf of christoudias June 27, 2025 19:15 View session

Implement replication lag detection for pgdog

3d70e70

Co-authored-by: christoudias <[email protected]>

Add replication lag configuration example to pgdog.toml

823dafb

Co-authored-by: christoudias <[email protected]>

Copilot AI changed the title ~~[WIP] Replication lag detection~~ Implement replication lag detection for automatic replica traffic management Jun 27, 2025

Copilot AI requested a review from christoudias June 27, 2025 19:33

Copilot finished work on behalf of christoudias June 27, 2025 19:33

christoudias marked this pull request as ready for review June 27, 2025 19:35

vitabaks reviewed Jul 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement replication lag detection for automatic replica traffic management #235

Implement replication lag detection for automatic replica traffic management #235

Uh oh!

Copilot AI commented Jun 27, 2025 •

edited

Loading

Uh oh!

CLAassistant commented Jun 27, 2025

Uh oh!

CLAassistant commented Jun 27, 2025

Uh oh!

vitabaks commented Jul 10, 2025

Uh oh!

vitabaks Jul 10, 2025

Uh oh!

Uh oh!

	# max_replication_lag_bytes = 1048576 # Ban replicas lagging by more than 1MB (default)
	# max_replication_lag_bytes = 1048576 # Ban replicas lagging by more than 1MB (default)

Implement replication lag detection for automatic replica traffic management #235

Are you sure you want to change the base?

Implement replication lag detection for automatic replica traffic management #235

Uh oh!

Conversation

Copilot AI commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Features

🔄 Automatic Lag Detection

🚦 Traffic Management

⚙️ Configuration

🏗️ Implementation Details

Example Usage

Testing

Uh oh!

CLAassistant commented Jun 27, 2025

Uh oh!

CLAassistant commented Jun 27, 2025

Uh oh!

vitabaks commented Jul 10, 2025

Uh oh!

vitabaks Jul 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI commented Jun 27, 2025 •

edited

Loading