fix(proxy): bound and pool peer route-sync so one slow replica can't stall the lifecycle path#414
Conversation
…stall the lifecycle path SyncRouteWithPeers fanned out one goroutine per peer per route change with no HTTP connection pooling and a retry budget larger than its own per-request timeout, while the caller blocked on wg.Wait(). At scale this was an outbound-connection / goroutine storm, and one slow replica added ~1s of head-of-line latency to every sandbox lifecycle operation fleet-wide. - Share a tuned http.Transport (pooled idle conns, keep-alives) instead of the default transport's 2-idle-conns-per-host. - Bound the fan-out with a concurrency semaphore instead of an unbounded goroutine-per-peer. - Replace the flat, oversized backoff with a real growth factor and an inter-attempt sleep budget strictly below the per-request timeout, and a predicate that only retries transient errors (not 4xx / permanent failures). - Cap the whole per-peer attempt+retry sequence with a context deadline so a hung replica cannot block the synchronous lifecycle path; the peer-reconcile loop remains the correctness backstop.
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #414 +/- ##
==========================================
+ Coverage 76.56% 76.60% +0.04%
==========================================
Files 150 150
Lines 10783 10802 +19
==========================================
+ Hits 8256 8275 +19
- Misses 2181 2182 +1
+ Partials 346 345 -1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Hi @furykerry @AiRanthem @zmberg, opened this to fix the peer route-sync path. It only touches the route-sync mechanics: a shared pooled HTTP transport, a bounded fan-out instead of one goroutine per peer, and a retry budget smaller than the per-request timeout so one slow replica can't stall the lifecycle path. What gets synced is unchanged, so it doesn't overlap #313 or #323, and there are new unit tests in pkg/proxy with the existing suites passing. |
…ning_sandbox, unrelated to pkg/proxy change)
Fixes #413
Why
SyncRouteWithPeersstarts one goroutine per peer on every route change, uses the default HTTP transport (no connection pooling), and has a retry budget larger than its own per-request timeout, all while the caller blocks onwg.Wait(). Under load this becomes a goroutine and outbound-connection storm, and a single slow replica can add roughly a second of latency to every sandbox lifecycle operation across the fleet. Full analysis in #413.What changed
http.Transport(bounded idle/total conns per host, keep-alives) so syncs reuse connections instead of dialing fresh ones.Behaviour is unchanged when all peers are healthy.
Testing
New unit tests in
pkg/proxy/peer_sync_test.go:TestIsRetryablePeerError: rejects context errors and 4xx (including 409), retries 429/5xx and transport errors.TestPeerSyncBackoffBudgetUnderRequestTimeout: total inter-attempt sleep budget is strictly less thanconsts.RequestPeerTimeout.TestSyncRouteWithPeers_ConcurrencyBounded: with 200 peers, peak in-flight requests never exceedpeerSyncMaxConcurrency.go test ./pkg/proxy/... ./pkg/sandbox-manager/...passes (existingSyncRouteWithPeersand memberlist tests still green);go vetandgofmtclean.Not automated: the 3-replica scenario with one replica slow on
POST /refreshwas reasoned through but not turned into a test.Scope
Only
pkg/proxyroute-sync transport, retry, and fan-out change. What gets synced is unchanged, so there is no overlap with #313 or #323. No new framework, tooling, or CI.