Skip to content

fix(proxy): bound and pool peer route-sync so one slow replica can't stall the lifecycle path#414

Open
PRAteek-singHWY wants to merge 2 commits into
openkruise:masterfrom
PRAteek-singHWY:fix/issue-413-route-sync-pooling
Open

fix(proxy): bound and pool peer route-sync so one slow replica can't stall the lifecycle path#414
PRAteek-singHWY wants to merge 2 commits into
openkruise:masterfrom
PRAteek-singHWY:fix/issue-413-route-sync-pooling

Conversation

@PRAteek-singHWY
Copy link
Copy Markdown
Contributor

@PRAteek-singHWY PRAteek-singHWY commented May 17, 2026

Fixes #413

Why

SyncRouteWithPeers starts one goroutine per peer on every route change, uses the default HTTP transport (no connection pooling), and has a retry budget larger than its own per-request timeout, all while the caller blocks on wg.Wait(). Under load this becomes a goroutine and outbound-connection storm, and a single slow replica can add roughly a second of latency to every sandbox lifecycle operation across the fleet. Full analysis in #413.

What changed

  • Connection pooling: the peer client now uses a shared, tuned http.Transport (bounded idle/total conns per host, keep-alives) so syncs reuse connections instead of dialing fresh ones.
  • Bounded fan-out: peer requests run under a concurrency semaphore instead of an unbounded goroutine per peer.
  • Retry: the inter-attempt sleep budget is now strictly smaller than the per-request timeout, the backoff actually grows, and only transient failures are retried (network errors, 429, 5xx). 4xx and other permanent errors fail fast.
  • A context deadline caps the whole per-peer attempt sequence so one hung replica can't stall the lifecycle path. The peer-reconcile loop is still the correctness backstop.

Behaviour is unchanged when all peers are healthy.

Testing

New unit tests in pkg/proxy/peer_sync_test.go:

  • TestIsRetryablePeerError: rejects context errors and 4xx (including 409), retries 429/5xx and transport errors.
  • TestPeerSyncBackoffBudgetUnderRequestTimeout: total inter-attempt sleep budget is strictly less than consts.RequestPeerTimeout.
  • TestSyncRouteWithPeers_ConcurrencyBounded: with 200 peers, peak in-flight requests never exceed peerSyncMaxConcurrency.

go test ./pkg/proxy/... ./pkg/sandbox-manager/... passes (existing SyncRouteWithPeers and memberlist tests still green); go vet and gofmt clean.

Not automated: the 3-replica scenario with one replica slow on POST /refresh was reasoned through but not turned into a test.

Scope

Only pkg/proxy route-sync transport, retry, and fan-out change. What gets synced is unchanged, so there is no overlap with #313 or #323. No new framework, tooling, or CI.

…stall the lifecycle path

SyncRouteWithPeers fanned out one goroutine per peer per route change with
no HTTP connection pooling and a retry budget larger than its own per-request
timeout, while the caller blocked on wg.Wait(). At scale this was an
outbound-connection / goroutine storm, and one slow replica added ~1s of
head-of-line latency to every sandbox lifecycle operation fleet-wide.

- Share a tuned http.Transport (pooled idle conns, keep-alives) instead of
  the default transport's 2-idle-conns-per-host.
- Bound the fan-out with a concurrency semaphore instead of an unbounded
  goroutine-per-peer.
- Replace the flat, oversized backoff with a real growth factor and an
  inter-attempt sleep budget strictly below the per-request timeout, and a
  predicate that only retries transient errors (not 4xx / permanent failures).
- Cap the whole per-peer attempt+retry sequence with a context deadline so a
  hung replica cannot block the synchronous lifecycle path; the peer-reconcile
  loop remains the correctness backstop.
@kruise-bot kruise-bot requested review from furykerry and zmberg May 17, 2026 18:58
@kruise-bot
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign furykerry for approval by writing /assign @furykerry in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@codecov
Copy link
Copy Markdown

codecov Bot commented May 17, 2026

Codecov Report

❌ Patch coverage is 90.32258% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.60%. Comparing base (524f78d) to head (bf3768f).

Files with missing lines Patch % Lines
pkg/proxy/utils.go 85.71% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #414      +/-   ##
==========================================
+ Coverage   76.56%   76.60%   +0.04%     
==========================================
  Files         150      150              
  Lines       10783    10802      +19     
==========================================
+ Hits         8256     8275      +19     
- Misses       2181     2182       +1     
+ Partials      346      345       -1     
Flag Coverage Δ
unittests 76.60% <90.32%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@PRAteek-singHWY
Copy link
Copy Markdown
Contributor Author

Hi @furykerry @AiRanthem @zmberg, opened this to fix the peer route-sync path. It only touches the route-sync mechanics: a shared pooled HTTP transport, a bounded fan-out instead of one goroutine per peer, and a retry budget smaller than the per-request timeout so one slow replica can't stall the lifecycle path. What gets synced is unchanged, so it doesn't overlap #313 or #323, and there are new unit tests in pkg/proxy with the existing suites passing.

…ning_sandbox, unrelated to pkg/proxy change)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Peer route sync does an unbounded per-peer fan-out with no connection pooling and a retry budget larger than its own timeout

2 participants