Skip to content

Conversation

m3hm3t
Copy link
Contributor

@m3hm3t m3hm3t commented Apr 2, 2025


This PR fixes #7896 a robustness bug in Citus where canceling the coordinator-side call to citus_rebalance_wait() (via a short statement timeout, Ctrl+C, or closing the session) does not cancel the underlying background rebalance operation. As a result, the background job continues running and leaves behind a “zombie” logical replication slot on a worker node. This active replication slot prevents subsequent operations (for example, DROP DATABASE WITH (force)) from succeeding on the worker node.

Background

When a rebalance is initiated using citus_rebalance_start(..., shard_transfer_mode => 'force_logical'), the background job creates a temporary logical replication slot on a source worker. If the coordinator-side wait is canceled (for example, due to a statement timeout), the background job is left running, and its replication slot remains active. This results in an inability to drop the database on the worker because PostgreSQL does not allow dropping a database that is in use by an active replication slot.

Description of the Fix

This PR modifies the internal wait function (citus_job_wait_internal) as follows:

  • PG_TRY/PG_CATCH Wrapping:
    The polling loop inside citus_job_wait_internal is now wrapped in a PG_TRY block. If the wait is canceled (e.g., by a timeout or Ctrl+C), the exception is caught in a PG_CATCH block.

  • Cancelling the Background Job:
    In the PG_CATCH block, we now call the existing cancellation function (i.e. citus_job_cancel) to mark the background rebalance job as canceled. This makes sure the background worker notices the canceled state and cleans up its temporary replication slot.

  • Re-Throwing the Exception:
    The original error (e.g., "canceling statement due to statement timeout") is rethrown via PG_RE_THROW(), so that users still receive the expected error message while ensuring that the underlying job is properly canceled.

Impact and Verification

  • Correctness:
    With this fix applied, if citus_rebalance_wait() is canceled, the underlying background job is also canceled. This prevents zombie replication slots from remaining on the worker nodes, thereby allowing operations like DROP DATABASE WITH (force) to succeed.

  • Verification:
    A new regression test has been added (see below) that:

    • Creates a distributed table and loads sufficient data to trigger a nontrivial rebalance.
    • Forces an imbalance by removing and then re-adding a worker node.
    • Schedules a rebalance job and waits on it with a short statement timeout, which triggers a cancellation.
    • Finally, the test reconnects to the coordinator and worker nodes to query pg_replication_slots—in the fixed system, no active replication slot should be left.

Main Branch Behavior

In the current main branch, the approach is somewhat flaky—sometimes an active replication slot remains even after canceling the wait. For example:

---------------------------------------------------------------------
-- Main branch: Traverse nodes and check for active replication slots.
-- 
-- Connect to the coordinator and worker nodes, then query for replication slots.
-- Expected Outcome (with the fix applied): No active replication slots.
---------------------------------------------------------------------
\c - - - :master_port
SELECT * FROM pg_replication_slots;
 slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn | wal_status | safe_wal_size | two_phase | inactive_since | conflicting | invalidation_reason | failover | synced
---------------------------------------------------------------------
(0 rows)

\c - - - :worker_1_port
SELECT * FROM pg_replication_slots;
           slot_name           |  plugin  | slot_type | datoid |  database  | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn | wal_status | safe_wal_size | two_phase | inactive_since | conflicting | invalidation_reason | failover | synced
---------------------------------------------------------------------
 citus_shard_move_slot_xxxxxxx_xxxxxxx_xxxxxxx | pgoutput | logical   | 16384  | regression | f         | t      |      36896 |      |          815 | 0/678F470   | 0/678F4A8           | reserved   |               | f         |                | f         |                     | f        | f
(1 row)

\c - - - :worker_2_port
SELECT * FROM pg_replication_slots;
 slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn | wal_status | safe_wal_size | two_phase | inactive_since | conflicting | invalidation_reason | failover | synced
---------------------------------------------------------------------
(0 rows)

In the main branch you may still see an active replication slot on one of the workers (e.g., on worker_1), which then blocks DROP DATABASE on that node. With this fix, the cancellation should remove that zombie slot consistently.


@m3hm3t m3hm3t self-assigned this Apr 2, 2025
Copy link

codecov bot commented Apr 2, 2025

Codecov Report

Attention: Patch coverage is 51.72414% with 14 lines in your changes missing coverage. Please review.

Project coverage is 89.16%. Comparing base (a7e686c) to head (1211422).
Report is 3 commits behind head on main.

❌ Your patch check has failed because the patch coverage (51.72%) is below the target coverage (75.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7945      +/-   ##
==========================================
- Coverage   89.18%   89.16%   -0.03%     
==========================================
  Files         283      283              
  Lines       61023    61055      +32     
  Branches     7618     7626       +8     
==========================================
+ Hits        54422    54437      +15     
- Misses       4416     4436      +20     
+ Partials     2185     2182       -3     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@m3hm3t m3hm3t changed the title Implement job cancellation mechanism in background job processing #7896 Fix zombie replication slot issue by canceling the underlying rebalance job on wait cancellation Apr 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

A bug about zombie logical replication slot that cannot be terminated after citus_rebalance_wait() with statement_timeout on version 13.0.1
1 participant