Skip to content

Fix GC race on freshly created dynamic pools via RAII guard#255

Merged
vadv merged 1 commit into
masterfrom
fix/dynamic-pool-init-guard
May 16, 2026
Merged

Fix GC race on freshly created dynamic pools via RAII guard#255
vadv merged 1 commit into
masterfrom
fix/dynamic-pool-init-guard

Conversation

@vadv
Copy link
Copy Markdown
Collaborator

@vadv vadv commented May 16, 2026

Closes #209.

Summary

A dynamic auth_query passthrough pool was inserted into POOLS before
its first server connection existed. Between insertion and the
get_server_parameters call that establishes that connection, a GC
sweep could observe pool_state().size == 0 and remove the pool. The
next client looking it up saw No pool configured until a fresh login
rebuilt it.

The previous mitigation was a 2-second created_at grace period in
gc_idle_dynamic_pools. It worked locally but on slow CI runners — and
in prefer TLS mode where the SSLRequest round-trip widens the
initialization window — the race still triggered.

What changes for the operator

  • A pool that has just been created stays in POOLS until its first
    server connection is established. GC keeps its hands off until
    initialization is done.
  • An auth flow that fails after pool creation (PG rejected a startup
    parameter, backend unreachable, panic in the middle) leaves no
    stale pool entry behind. The next login starts from a clean slate.
  • The created_at field and its 2-second grace period are gone.

What changes internally

  • ConnectionPool carries an init_complete: Arc<AtomicBool>. Static
    pools start with true. Dynamic pools start with false.
  • create_dynamic_pool returns (ConnectionPool, PoolInitGuard). The
    guard owns that flag and the pool identifier. commit flips the
    flag to true; Drop without commit calls drop_dynamic_pool.
  • The auth flow calls init_guard.commit() right after the first
    successful get_server_parameters. Every error path returns before
    commit, so Drop cleans up.
  • gc_idle_dynamic_pools now checks init_complete instead of an
    elapsed-time window. The decision logic is extracted into a small
    pure function (should_gc_idle_pool) covered by unit tests.

The earlier structural attempt (089fc22, reverted in 760ac60)
moved the POOLS insertion to after the first connection. Two
concurrent first logins for the same user could then each build a
separate ConnectionPool, leaving one client wired to a bb8::Pool
that was about to be overwritten — the hang in 760ac60 traced back
to that race. The guard approach keeps the insertion order (POOLS
first, then connection) so the single-pool invariant survives, and
relies on the flag to close the GC window.

Test plan

  • cargo test --lib pool::gc::tests — 4 unit tests on the pure
    decision function. Two of them encode the regression directly:
    idle_pool_still_initializing_is_skipped and
    flipping_init_complete_makes_pool_eligible. Without this fix
    they would assert against the missing init_complete field.
  • cargo test --lib pool::init_guard::tests — 2 unit tests on the
    guard (already_committed_leaves_flag_true_and_does_nothing_on_drop,
    commit_flips_flag_and_blocks_drop_cleanup).
  • make test-bdd TAGS=@auth-query — 10 features, 42 scenarios,
    360 steps, all green.
  • make test-bdd TAGS=@auth-query-init-race — new scenario opens
    two concurrent first logins for two different dynamic users with
    retain_connections_time = 200ms and short idle/server
    lifetimes, then asserts that both succeed and that GC reaps the
    pools the normal way once connections drain. On CI runners where
    the historical 2 s grace period was not enough, this scenario
    would have flaked without the fix.
  • cargo fmt, cargo clippy --lib --tests -- --deny warnings,
    and the full unit-test suite are clean.

Performance

The new flag is a single Arc<AtomicBool>:

  • BASELINE: gc_idle_dynamic_pools reads each dynamic pool once per
    sweep; create_dynamic_pool runs once per first login per user.
  • BOTTLENECK: none.
  • EVIDENCE: the only new operations are one relaxed atomic load per
    pool per sweep and one relaxed atomic store at first-login success.
  • IMPACT: indistinguishable from the previous Instant::elapsed
    branch in any production workload; both bound by O(N) GC sweep
    over dynamic pools.

Before: a dynamic auth_query passthrough pool was inserted into POOLS
before its first server connection existed. During the get_server_parameters
call that establishes that connection — under the SSLRequest handshake in
prefer mode this is several round-trips — a GC sweep could observe
pool_state().size == 0 and remove the pool. The next client looking it up
got "No pool configured" until a fresh login rebuilt it. The 2-second
created_at grace period mitigated the race but did not close it.

Now: ConnectionPool carries an init_complete: Arc<AtomicBool> flag, false
on dynamic pool creation and true on static pools (which never race with
GC). create_dynamic_pool returns (pool, PoolInitGuard); the guard owns
that flag. The auth flow calls guard.commit() right after the first
get_server_parameters succeeds — at that point the flag flips to true and
GC treats the pool like any other. If the auth flow fails or the guard
goes out of scope without commit, Drop calls drop_dynamic_pool so the
next login starts from a clean slate. GC checks the flag instead of an
elapsed-time grace period, so the window is closed regardless of how
long the first probe takes.

The earlier structural attempt (commit 089fc22, reverted in 760ac60)
tried to delay POOLS insertion until after the first connection. Two
concurrent first logins for the same user would each get None from
get_pool and build a separate ConnectionPool, leaving the second
client wired to a bb8 Pool that was about to be overwritten. The guard
approach keeps the insertion order (POOLS first, then connection) and
relies on the flag — no race, no double-pool.
@vadv vadv merged commit 9ca3eed into master May 16, 2026
50 checks passed
@vadv vadv deleted the fix/dynamic-pool-init-guard branch May 16, 2026 18:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GC race: dynamic pool can be removed before first connection

1 participant