#7896 Fix zombie replication slot issue by canceling the underlying rebalance job on wait cancellation #7945

m3hm3t · 2025-04-02T13:18:56Z

This PR fixes #7896 a robustness bug in Citus where canceling the coordinator-side call to citus_rebalance_wait() (via a short statement timeout, Ctrl+C, or closing the session) does not cancel the underlying background rebalance operation. As a result, the background job continues running and leaves behind a “zombie” logical replication slot on a worker node. This active replication slot prevents subsequent operations (for example, DROP DATABASE WITH (force)) from succeeding on the worker node.

Background

When a rebalance is initiated using citus_rebalance_start(..., shard_transfer_mode => 'force_logical'), the background job creates a temporary logical replication slot on a source worker. If the coordinator-side wait is canceled (for example, due to a statement timeout), the background job is left running, and its replication slot remains active. This results in an inability to drop the database on the worker because PostgreSQL does not allow dropping a database that is in use by an active replication slot.

Description of the Fix

This PR modifies the internal wait function (citus_job_wait_internal) as follows:

PG_TRY/PG_CATCH Wrapping:
The polling loop inside citus_job_wait_internal is now wrapped in a PG_TRY block. If the wait is canceled (e.g., by a timeout or Ctrl+C), the exception is caught in a PG_CATCH block.
Cancelling the Background Job:
In the PG_CATCH block, we now call the existing cancellation function (i.e. citus_job_cancel) to mark the background rebalance job as canceled. This makes sure the background worker notices the canceled state and cleans up its temporary replication slot.
Re-Throwing the Exception:
The original error (e.g., "canceling statement due to statement timeout") is rethrown via PG_RE_THROW(), so that users still receive the expected error message while ensuring that the underlying job is properly canceled.

Impact and Verification

Correctness:
With this fix applied, if citus_rebalance_wait() is canceled, the underlying background job is also canceled. This prevents zombie replication slots from remaining on the worker nodes, thereby allowing operations like DROP DATABASE WITH (force) to succeed.
Verification:
A new regression test has been added (see below) that:
- Creates a distributed table and loads sufficient data to trigger a nontrivial rebalance.
- Forces an imbalance by removing and then re-adding a worker node.
- Schedules a rebalance job and waits on it with a short statement timeout, which triggers a cancellation.
- Finally, the test reconnects to the coordinator and worker nodes to query pg_replication_slots—in the fixed system, no active replication slot should be left.

Main Branch Behavior

In the current main branch, the approach is somewhat flaky—sometimes an active replication slot remains even after canceling the wait. For example:

---------------------------------------------------------------------
-- Main branch: Traverse nodes and check for active replication slots.
-- 
-- Connect to the coordinator and worker nodes, then query for replication slots.
-- Expected Outcome (with the fix applied): No active replication slots.
---------------------------------------------------------------------
\c - - - :master_port
SELECT * FROM pg_replication_slots;
 slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn | wal_status | safe_wal_size | two_phase | inactive_since | conflicting | invalidation_reason | failover | synced
---------------------------------------------------------------------
(0 rows)

\c - - - :worker_1_port
SELECT * FROM pg_replication_slots;
           slot_name           |  plugin  | slot_type | datoid |  database  | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn | wal_status | safe_wal_size | two_phase | inactive_since | conflicting | invalidation_reason | failover | synced
---------------------------------------------------------------------
 citus_shard_move_slot_xxxxxxx_xxxxxxx_xxxxxxx | pgoutput | logical   | 16384  | regression | f         | t      |      36896 |      |          815 | 0/678F470   | 0/678F4A8           | reserved   |               | f         |                | f         |                     | f        | f
(1 row)

\c - - - :worker_2_port
SELECT * FROM pg_replication_slots;
 slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn | wal_status | safe_wal_size | two_phase | inactive_since | conflicting | invalidation_reason | failover | synced
---------------------------------------------------------------------
(0 rows)

In the main branch you may still see an active replication slot on one of the workers (e.g., on worker_1), which then blocks DROP DATABASE on that node. With this fix, the cancellation should remove that zombie slot consistently.

codecov · 2025-04-02T14:03:16Z

Codecov Report

Attention: Patch coverage is 51.72414% with 14 lines in your changes missing coverage. Please review.

Project coverage is 89.16%. Comparing base (a7e686c) to head (1211422).
Report is 3 commits behind head on main.

❌ Your patch check has failed because the patch coverage (51.72%) is below the target coverage (75.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #7945      +/-   ##
==========================================
- Coverage   89.18%   89.16%   -0.03%     
==========================================
  Files         283      283              
  Lines       61023    61055      +32     
  Branches     7618     7626       +8     
==========================================
+ Hits        54422    54437      +15     
- Misses       4416     4436      +20     
+ Partials     2185     2182       -3

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…oved clarity and maintainability

…ror handling

…processing

…ncellation

…t for clarity

…readability

…t_min_messages settings for improved clarity

Implement job cancellation mechanism in background job processing

3c23f89

m3hm3t self-assigned this Apr 2, 2025

m3hm3t added 5 commits April 4, 2025 11:02

Refactor job cancellation logic in background job processing for impr…

85a1c73

…oved clarity and maintainability

Replace job cancellation call with DirectFunctionCall for improved er…

62ef1c3

…ror handling

Remove unused citus_cancel_job function to streamline background job …

3ed1284

…processing

Add regression tests for job cancellation behavior in background job …

5b6b7b8

…processing

Add regression test for zombie replication slot cleanup during job ca…

23a4671

…ncellation

m3hm3t changed the title ~~Implement job cancellation mechanism in background job processing~~ #7896 Fix zombie replication slot issue by canceling the underlying rebalance job on wait cancellation Apr 8, 2025

m3hm3t added 2 commits April 8, 2025 11:54

Remove unused citus-tools subproject and update regression test outpu…

a057fce

…t for clarity

Update test cases to separate issue_7896 for improved clarity

bc6a19a

m3hm3t mentioned this pull request Apr 8, 2025

A bug about zombie logical replication slot that cannot be terminated after citus_rebalance_wait() with statement_timeout on version 13.0.1 #7896

Open

m3hm3t added 3 commits April 8, 2025 12:20

Add issue_7896 test cases to regression suite for improved coverage

2f7f865

Remove redundant whitespace in multi_schedule test file for improved …

f6eef65

…readability

Update expected output and SQL scripts for issue_7896 to adjust clien…

1211422

…t_min_messages settings for improved clarity

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

#7896 Fix zombie replication slot issue by canceling the underlying rebalance job on wait cancellation #7945

#7896 Fix zombie replication slot issue by canceling the underlying rebalance job on wait cancellation #7945

Uh oh!

m3hm3t commented Apr 2, 2025 •

edited

Loading

Uh oh!

codecov bot commented Apr 2, 2025 •

edited

Loading

Uh oh!

Uh oh!

#7896 Fix zombie replication slot issue by canceling the underlying rebalance job on wait cancellation #7945

Are you sure you want to change the base?

#7896 Fix zombie replication slot issue by canceling the underlying rebalance job on wait cancellation #7945

Uh oh!

Conversation

m3hm3t commented Apr 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background

Description of the Fix

Impact and Verification

Main Branch Behavior

Uh oh!

codecov bot commented Apr 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

m3hm3t commented Apr 2, 2025 •

edited

Loading

codecov bot commented Apr 2, 2025 •

edited

Loading