[FEATURE] LLM API circuit breaker for transient failure recovery

## Feature Description

LLM API backends (Claude, OpenAI ChatCompletions, Response API) have no retry or circuit breaker on transient failures (500, 502, 503, 529). A single API error terminates the agent's turn with no recovery attempt. Gemini is the only backend with retry logic (`BackoffConfig`, 5 attempts with exponential backoff in `_with_backoff_and_retry`).

429 (rate limit) requires special handling -- it is not a single failure mode but three distinct cases depending on `Retry-After` header presence and value. See Proposed Solution below.

## Motivation

When a provider returns a transient error, the error propagates through `chat_agent.py` as a `StreamChunk(type="error")` and the agent's turn ends. In multi-agent orchestration this is especially costly -- one transient API hiccup can derail a coordinated round that has already consumed significant tokens.

The MCP layer already has `MCPCircuitBreaker` (`mcp_tools/circuit_breaker.py`) with failure counting and exponential backoff. The LLM API layer has no equivalent.

## Proposed Solution

Add `LLMBackendCircuitBreaker` following the same pattern as `MCPCircuitBreaker`:

* **States:** CLOSED (normal), OPEN (blocking after repeated failures), HALF_OPEN (probe after backoff)
* **Scope:** Per-provider, shared across agents via the same pattern as `GlobalRateLimiter`

### 429 Classification

429 responses are classified into three patterns based on `Retry-After` header (credit: @SirBrenton):

| Pattern | Condition | Behavior |
| -- | -- | -- |
| **WAIT** | `Retry-After` present and <= threshold (default 60s) | Honor header, retry after specified delay. Count as soft failure (does not increment CB failure counter). |
| **STOP** | `Retry-After` present and > threshold, or response indicates quota exhaustion | CB transitions to OPEN immediately. No retry -- fast-fail is the correct behavior for quota exhaustion. |
| **CAP** | No `Retry-After` header | Reduce concurrency (lower parallel request count). Retry with reduced parallelism. |

The threshold between WAIT and STOP is configurable via `retry_after_threshold_seconds`.

### Config

```yaml
llm_circuit_breaker:
  enabled: false          # opt-in, no behavior change by default
  max_failures: 5
  reset_time_seconds: 60
  backoff_multiplier: 2.0
  max_backoff_seconds: 300
  retry_after_threshold_seconds: 60  # 429 with Retry-After above this -> STOP
  retryable_status_codes: [500, 502, 503, 529]  # 429 handled separately via classification
```

Integration point: wrap the API call in `base.py` or `chat_agent.py` with CB check + retry on retryable errors. 429 responses bypass `retryable_status_codes` and route through the WAIT/STOP/CAP classifier instead.

## Alternatives Considered

1. **Per-backend retry only (no CB state machine):** Simpler, but doesn't prevent retry storms when a provider is down for an extended period. CB's OPEN state provides fast-fail.
2. **Rely on provider SDK retries:** OpenAI and Anthropic SDKs have built-in retries, but MassGen wraps calls in streaming generators that may not surface SDK-level retries correctly. Explicit CB gives MassGen control over the policy.
3. **Extend Gemini's** `BackoffConfig` to all backends: Possible, but Gemini's implementation is tightly coupled to its error types. A shared CB layer is cleaner.

## Use Cases

1. **Claude returns 529 (overloaded) during a 3-agent orchestration round.** CB retries twice with backoff, round completes without manual intervention.
2. **OpenAI returns 429 with `Retry-After: 2` during high-concurrency runs (WAIT).** CB honors the 2-second delay, retries, and the request succeeds. No failure counted against the CB threshold.
3. **Claude returns 429 with `Retry-After: 120` indicating quota exhaustion (STOP).** CB transitions to OPEN immediately -- no retry, agents get fast-fail responses. Pairs with `RoundBudgetGuardHook` ([MAS-237](https://linear.app/massgen-ai/issue/MAS-237/feature-per-round-cost-awareness-inject-running-cost-with-tool-results)) to cap cost during degraded operation.
4. **OpenAI returns 429 with no `Retry-After` header (CAP).** CB reduces concurrency from e.g. 4 parallel requests to 2, retries with lower parallelism.
5. **Provider outage lasting minutes.** CB enters OPEN state after 5 failures, agents get fast-fail responses instead of waiting for timeouts on every call.

## Implementation Suggestions

* New file: `massgen/backend/llm_circuit_breaker.py` (~200 lines)
* Modify: `base.py` (CB initialization, ~30 lines), `chat_agent.py` (retry loop in stream processing, ~50 lines)
* Mirror `MCPCircuitBreaker` interface: `should_block()`, `record_failure()`, `record_success()`
* Add `classify_429(status_code, headers) -> WAIT | STOP | CAP` to route 429 responses
* Gemini's existing `BackoffConfig` should be reconciled -- either delegate to the shared CB or explicitly bypass it with a config flag
* `enable_llm_circuit_breaker: false` default -- no change to existing behavior unless opted in

### Phased rollout (per @ncrispino)

1. **Phase 1:** Implement CB + 429 classifier for Claude backend. Write tests, validate with 429/529 scenarios.
2. **Phase 2:** Apply to ChatCompletions and Response API backends.
3. **Phase 3:** Reconcile Gemini's `BackoffConfig` with the shared CB layer.

## Additional Context

* Related: [MAS-349](https://linear.app/massgen-ai/issue/MAS-349/bug-managed-round-evaluator-over-indexes-on-incremental-fixes-despite) (evaluator cost overrun -- uncontrolled retries may contribute)
* Related: [MAS-347](https://linear.app/massgen-ai/issue/MAS-347/feature-graceful-timeout-submit-current-answers-when-time-is-running) (graceful timeout -- CB's fast-fail complements timeout handling)
* Related: [MAS-237](https://linear.app/massgen-ai/issue/MAS-237/feature-per-round-cost-awareness-inject-running-cost-with-tool-results) / PR [#1013](https://github.com/massgen/MassGen/issues/1013) (budget enforcement -- CB + budget guard form two complementary defense layers)
* `MCPCircuitBreaker` in `mcp_tools/circuit_breaker.py` is the closest existing precedent (229 lines, same state machine pattern)

Current retry support by backend:

| Backend | 429 detection | 429 classification | Retry | Retry-After | CB |
| -- | -- | -- | -- | -- | -- |
| Gemini | yes | no | yes (5x) | yes | no |
| Claude | no | no | no | no | no |
| ChatCompletions | no | no | no | no | no |
| Response API | no | no | no | no | no |

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] LLM API circuit breaker for transient failure recovery #1014

Feature Description

Motivation

Proposed Solution

429 Classification

Config

Alternatives Considered

Use Cases

Implementation Suggestions

Phased rollout (per @ncrispino)

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Pattern	Condition	Behavior
WAIT	`Retry-After` present and <= threshold (default 60s)	Honor header, retry after specified delay. Count as soft failure (does not increment CB failure counter).
STOP	`Retry-After` present and > threshold, or response indicates quota exhaustion	CB transitions to OPEN immediately. No retry -- fast-fail is the correct behavior for quota exhaustion.
CAP	No `Retry-After` header	Reduce concurrency (lower parallel request count). Retry with reduced parallelism.

Backend	429 detection	429 classification	Retry	Retry-After	CB
Gemini	yes	no	yes (5x)	yes	no
Claude	no	no	no	no	no
ChatCompletions	no	no	no	no	no
Response API	no	no	no	no	no

[FEATURE] LLM API circuit breaker for transient failure recovery #1014

Description

Feature Description

Motivation

Proposed Solution

429 Classification

Config

Alternatives Considered

Use Cases

Implementation Suggestions

Phased rollout (per @ncrispino)

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions