Skip to content

[FEATURE] LLM API circuit breaker for transient failure recovery #1014

@amabito

Description

@amabito

Feature Description

LLM API backends (Claude, OpenAI ChatCompletions, Response API) have no retry or circuit breaker on transient failures (500, 502, 503, 529). A single API error terminates the agent's turn with no recovery attempt. Gemini is the only backend with retry logic (BackoffConfig, 5 attempts with exponential backoff in _with_backoff_and_retry).

429 (rate limit) requires special handling -- it is not a single failure mode but three distinct cases depending on Retry-After header presence and value. See Proposed Solution below.

Motivation

When a provider returns a transient error, the error propagates through chat_agent.py as a StreamChunk(type="error") and the agent's turn ends. In multi-agent orchestration this is especially costly -- one transient API hiccup can derail a coordinated round that has already consumed significant tokens.

The MCP layer already has MCPCircuitBreaker (mcp_tools/circuit_breaker.py) with failure counting and exponential backoff. The LLM API layer has no equivalent.

Proposed Solution

Add LLMBackendCircuitBreaker following the same pattern as MCPCircuitBreaker:

  • States: CLOSED (normal), OPEN (blocking after repeated failures), HALF_OPEN (probe after backoff)
  • Scope: Per-provider, shared across agents via the same pattern as GlobalRateLimiter

429 Classification

429 responses are classified into three patterns based on Retry-After header (credit: @SirBrenton):

Pattern Condition Behavior
WAIT Retry-After present and <= threshold (default 60s) Honor header, retry after specified delay. Count as soft failure (does not increment CB failure counter).
STOP Retry-After present and > threshold, or response indicates quota exhaustion CB transitions to OPEN immediately. No retry -- fast-fail is the correct behavior for quota exhaustion.
CAP No Retry-After header Reduce concurrency (lower parallel request count). Retry with reduced parallelism.

The threshold between WAIT and STOP is configurable via retry_after_threshold_seconds.

Config

llm_circuit_breaker:
  enabled: false          # opt-in, no behavior change by default
  max_failures: 5
  reset_time_seconds: 60
  backoff_multiplier: 2.0
  max_backoff_seconds: 300
  retry_after_threshold_seconds: 60  # 429 with Retry-After above this -> STOP
  retryable_status_codes: [500, 502, 503, 529]  # 429 handled separately via classification

Integration point: wrap the API call in base.py or chat_agent.py with CB check + retry on retryable errors. 429 responses bypass retryable_status_codes and route through the WAIT/STOP/CAP classifier instead.

Alternatives Considered

  1. Per-backend retry only (no CB state machine): Simpler, but doesn't prevent retry storms when a provider is down for an extended period. CB's OPEN state provides fast-fail.
  2. Rely on provider SDK retries: OpenAI and Anthropic SDKs have built-in retries, but MassGen wraps calls in streaming generators that may not surface SDK-level retries correctly. Explicit CB gives MassGen control over the policy.
  3. Extend Gemini's BackoffConfig to all backends: Possible, but Gemini's implementation is tightly coupled to its error types. A shared CB layer is cleaner.

Use Cases

  1. Claude returns 529 (overloaded) during a 3-agent orchestration round. CB retries twice with backoff, round completes without manual intervention.
  2. OpenAI returns 429 with Retry-After: 2 during high-concurrency runs (WAIT). CB honors the 2-second delay, retries, and the request succeeds. No failure counted against the CB threshold.
  3. Claude returns 429 with Retry-After: 120 indicating quota exhaustion (STOP). CB transitions to OPEN immediately -- no retry, agents get fast-fail responses. Pairs with RoundBudgetGuardHook (MAS-237) to cap cost during degraded operation.
  4. OpenAI returns 429 with no Retry-After header (CAP). CB reduces concurrency from e.g. 4 parallel requests to 2, retries with lower parallelism.
  5. Provider outage lasting minutes. CB enters OPEN state after 5 failures, agents get fast-fail responses instead of waiting for timeouts on every call.

Implementation Suggestions

  • New file: massgen/backend/llm_circuit_breaker.py (~200 lines)
  • Modify: base.py (CB initialization, ~30 lines), chat_agent.py (retry loop in stream processing, ~50 lines)
  • Mirror MCPCircuitBreaker interface: should_block(), record_failure(), record_success()
  • Add classify_429(status_code, headers) -> WAIT | STOP | CAP to route 429 responses
  • Gemini's existing BackoffConfig should be reconciled -- either delegate to the shared CB or explicitly bypass it with a config flag
  • enable_llm_circuit_breaker: false default -- no change to existing behavior unless opted in

Phased rollout (per @ncrispino)

  1. Phase 1: Implement CB + 429 classifier for Claude backend. Write tests, validate with 429/529 scenarios.
  2. Phase 2: Apply to ChatCompletions and Response API backends.
  3. Phase 3: Reconcile Gemini's BackoffConfig with the shared CB layer.

Additional Context

  • Related: MAS-349 (evaluator cost overrun -- uncontrolled retries may contribute)
  • Related: MAS-347 (graceful timeout -- CB's fast-fail complements timeout handling)
  • Related: MAS-237 / PR #1013 (budget enforcement -- CB + budget guard form two complementary defense layers)
  • MCPCircuitBreaker in mcp_tools/circuit_breaker.py is the closest existing precedent (229 lines, same state machine pattern)

Current retry support by backend:

Backend 429 detection 429 classification Retry Retry-After CB
Gemini yes no yes (5x) yes no
Claude no no no no no
ChatCompletions no no no no no
Response API no no no no no

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions