-
Notifications
You must be signed in to change notification settings - Fork 140
Description
Feature Description
LLM API backends (Claude, OpenAI ChatCompletions, Response API) have no retry or circuit breaker on transient failures (500, 502, 503, 529). A single API error terminates the agent's turn with no recovery attempt. Gemini is the only backend with retry logic (BackoffConfig, 5 attempts with exponential backoff in _with_backoff_and_retry).
429 (rate limit) requires special handling -- it is not a single failure mode but three distinct cases depending on Retry-After header presence and value. See Proposed Solution below.
Motivation
When a provider returns a transient error, the error propagates through chat_agent.py as a StreamChunk(type="error") and the agent's turn ends. In multi-agent orchestration this is especially costly -- one transient API hiccup can derail a coordinated round that has already consumed significant tokens.
The MCP layer already has MCPCircuitBreaker (mcp_tools/circuit_breaker.py) with failure counting and exponential backoff. The LLM API layer has no equivalent.
Proposed Solution
Add LLMBackendCircuitBreaker following the same pattern as MCPCircuitBreaker:
- States: CLOSED (normal), OPEN (blocking after repeated failures), HALF_OPEN (probe after backoff)
- Scope: Per-provider, shared across agents via the same pattern as
GlobalRateLimiter
429 Classification
429 responses are classified into three patterns based on Retry-After header (credit: @SirBrenton):
| Pattern | Condition | Behavior |
|---|---|---|
| WAIT | Retry-After present and <= threshold (default 60s) |
Honor header, retry after specified delay. Count as soft failure (does not increment CB failure counter). |
| STOP | Retry-After present and > threshold, or response indicates quota exhaustion |
CB transitions to OPEN immediately. No retry -- fast-fail is the correct behavior for quota exhaustion. |
| CAP | No Retry-After header |
Reduce concurrency (lower parallel request count). Retry with reduced parallelism. |
The threshold between WAIT and STOP is configurable via retry_after_threshold_seconds.
Config
llm_circuit_breaker:
enabled: false # opt-in, no behavior change by default
max_failures: 5
reset_time_seconds: 60
backoff_multiplier: 2.0
max_backoff_seconds: 300
retry_after_threshold_seconds: 60 # 429 with Retry-After above this -> STOP
retryable_status_codes: [500, 502, 503, 529] # 429 handled separately via classificationIntegration point: wrap the API call in base.py or chat_agent.py with CB check + retry on retryable errors. 429 responses bypass retryable_status_codes and route through the WAIT/STOP/CAP classifier instead.
Alternatives Considered
- Per-backend retry only (no CB state machine): Simpler, but doesn't prevent retry storms when a provider is down for an extended period. CB's OPEN state provides fast-fail.
- Rely on provider SDK retries: OpenAI and Anthropic SDKs have built-in retries, but MassGen wraps calls in streaming generators that may not surface SDK-level retries correctly. Explicit CB gives MassGen control over the policy.
- Extend Gemini's
BackoffConfigto all backends: Possible, but Gemini's implementation is tightly coupled to its error types. A shared CB layer is cleaner.
Use Cases
- Claude returns 529 (overloaded) during a 3-agent orchestration round. CB retries twice with backoff, round completes without manual intervention.
- OpenAI returns 429 with
Retry-After: 2during high-concurrency runs (WAIT). CB honors the 2-second delay, retries, and the request succeeds. No failure counted against the CB threshold. - Claude returns 429 with
Retry-After: 120indicating quota exhaustion (STOP). CB transitions to OPEN immediately -- no retry, agents get fast-fail responses. Pairs withRoundBudgetGuardHook(MAS-237) to cap cost during degraded operation. - OpenAI returns 429 with no
Retry-Afterheader (CAP). CB reduces concurrency from e.g. 4 parallel requests to 2, retries with lower parallelism. - Provider outage lasting minutes. CB enters OPEN state after 5 failures, agents get fast-fail responses instead of waiting for timeouts on every call.
Implementation Suggestions
- New file:
massgen/backend/llm_circuit_breaker.py(~200 lines) - Modify:
base.py(CB initialization, ~30 lines),chat_agent.py(retry loop in stream processing, ~50 lines) - Mirror
MCPCircuitBreakerinterface:should_block(),record_failure(),record_success() - Add
classify_429(status_code, headers) -> WAIT | STOP | CAPto route 429 responses - Gemini's existing
BackoffConfigshould be reconciled -- either delegate to the shared CB or explicitly bypass it with a config flag enable_llm_circuit_breaker: falsedefault -- no change to existing behavior unless opted in
Phased rollout (per @ncrispino)
- Phase 1: Implement CB + 429 classifier for Claude backend. Write tests, validate with 429/529 scenarios.
- Phase 2: Apply to ChatCompletions and Response API backends.
- Phase 3: Reconcile Gemini's
BackoffConfigwith the shared CB layer.
Additional Context
- Related: MAS-349 (evaluator cost overrun -- uncontrolled retries may contribute)
- Related: MAS-347 (graceful timeout -- CB's fast-fail complements timeout handling)
- Related: MAS-237 / PR #1013 (budget enforcement -- CB + budget guard form two complementary defense layers)
MCPCircuitBreakerinmcp_tools/circuit_breaker.pyis the closest existing precedent (229 lines, same state machine pattern)
Current retry support by backend:
| Backend | 429 detection | 429 classification | Retry | Retry-After | CB |
|---|---|---|---|---|---|
| Gemini | yes | no | yes (5x) | yes | no |
| Claude | no | no | no | no | no |
| ChatCompletions | no | no | no | no | no |
| Response API | no | no | no | no | no |