Add LLM API retry engine for transient failures#211
Add LLM API retry engine for transient failures#211nhicks00 wants to merge 2 commits intompfaffenberger:mainfrom
Conversation
Add llm_run_with_retry() that wraps pydantic_agent.run() calls with retry logic for transient API failures (429, 529, 5xx, network errors, streaming failures). Key features: - Exponential backoff with 25% jitter, 500ms base, 32s cap - Retry-After header takes absolute priority - 529 consecutive overload detection (short-circuits after 3) - Abort-aware sleep via asyncio.Event for user cancellation - Streaming error detection (handles pydantic_ai stream failures) - Full __cause__ chain walker for wrapped network errors - api_retry_start / api_retry_end callback hooks - PUPPY_MAX_LLM_RETRIES env var override (default 10) Walmart fork sync notes: - callbacks.py: additive merge — keep run_shell_command_output, add api_retry_start/api_retry_end - base_agent.py: delete _run_with_streaming_retry() entirely, replace with llm_run_with_retry(). Rename event_stream_handler -> stream_handler to match fork naming. - llm_retry.py and tests/test_llm_retry.py are clean file adds Resolves: mpfaffenberger#210 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
📝 WalkthroughWalkthroughAdds a new LLM API retry engine and integrates it into agent execution: introduces Changes
Sequence DiagramsequenceDiagram
participant Caller as Agent Caller
participant RetryWrapper as llm_run_with_retry
participant AgentRun as pydantic_agent.run
participant API as LLM API
participant Callbacks as Callbacks
Caller->>RetryWrapper: llm_run_with_retry(coro_factory, config)
loop Retry Loop (attempts ≤ max_retries)
RetryWrapper->>AgentRun: Invoke coro_factory() (Attempt N)
AgentRun->>API: LLM request
alt Success
API-->>AgentRun: Response
AgentRun-->>RetryWrapper: Result
RetryWrapper->>Callbacks: api_retry_end(total_attempts)
RetryWrapper-->>Caller: Return result
else Retryable Error
API-->>AgentRun: Transient error (429/5xx/timeout/overload)
AgentRun-->>RetryWrapper: Exception
RetryWrapper->>Callbacks: api_retry_start(error, attempt, delay_ms, max_retries)
RetryWrapper->>RetryWrapper: Compute delay (Retry-After or exp+jit)
alt Cancel Event Set
RetryWrapper-->>Caller: Propagate CancelledError
else Sleep then retry
RetryWrapper->>RetryWrapper: Sleep(delay_ms) (cancellable)
end
else Non-retryable / Exhausted
API-->>AgentRun: Fatal error or retries exhausted
AgentRun-->>RetryWrapper: Exception
RetryWrapper-->>Caller: Raise RetryExhaustedError (with original)
end
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
code_puppy/agents/base_agent.py (1)
1909-1910: Remove duplicate_toolsetsassignment.Line 1910 repeats the same assignment from Line 1909.
✂️ Tiny cleanup
original_toolsets = pydantic_agent._toolsets pydantic_agent._toolsets = original_toolsets + self._mcp_servers - pydantic_agent._toolsets = original_toolsets + self._mcp_servers🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@code_puppy/agents/base_agent.py` around lines 1909 - 1910, Remove the duplicate assignment to pydantic_agent._toolsets: there are two identical lines setting pydantic_agent._toolsets = original_toolsets + self._mcp_servers; keep a single assignment and delete the redundant one so only one update to pydantic_agent._toolsets remains (refer to pydantic_agent, _toolsets, original_toolsets, and self._mcp_servers to locate the lines).tests/test_llm_retry.py (1)
385-399:test_retry_after_header_respecteddoes not currently prove header precedence.Using
retry-after: "0"plus no_compute_backoffargument assertion allows this test to pass even if header extraction regresses.✅ Make the test assert the actual contract
async def test_retry_after_header_respected(self): """Retry-After header value flows into backoff calculation.""" call_count = 0 async def factory(): nonlocal call_count call_count += 1 if call_count == 1: - raise _make_api_error(429, headers={"retry-after": "0"}) + raise _make_api_error(429, headers={"retry-after": "1.5"}) return "ok" - # We want to verify the header was extracted, not that we waited - result = await llm_run_with_retry(factory, config=LLMRetryConfig(max_retries=3)) + with patch("code_puppy.llm_retry._compute_backoff", return_value=0.001) as backoff: + result = await llm_run_with_retry( + factory, config=LLMRetryConfig(max_retries=3) + ) + backoff.assert_called_once_with(1, 1.5) assert result == "ok" assert call_count == 2🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_llm_retry.py` around lines 385 - 399, Update test_retry_after_header_respected to assert header precedence by injecting a custom _compute_backoff that would return a different backoff than the Retry-After header so the test fails if header extraction regresses: have the factory raise a 429 with a non-zero "retry-after" value, call llm_run_with_retry(factory, config=LLMRetryConfig(max_retries=3), _compute_backoff=<custom_callable>) where <custom_callable> returns a sentinel backoff, and assert the retry used the header-derived value (e.g., by observing call_count==2 and/or by verifying that the computed backoff came from the header rather than the injected _compute_backoff). Ensure you reference test_retry_after_header_respected, llm_run_with_retry, _compute_backoff, and LLMRetryConfig when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@code_puppy/llm_retry.py`:
- Around line 176-194: The retry logic currently treats status 400 as fatal;
update the function that uses _get_status_code (the is_retryable / should_retry
path around _get_status_code(error)) to detect the specific context-overflow 400
response and implement the context-overflow recovery: when the error payload
indicates inputTokens/contextLimit overflow, parse inputTokens and contextLimit
from the error, compute available = contextLimit - inputTokens, set the retry
request's max_tokens to max(3000, available) and flag this retry to happen
immediately (no sleep/backoff), then requeue the request; leave other 400s
unchanged (still fatal). Ensure this same handling is added where the code
currently exits on 400 in the other retry branch referenced (the block around
the immediate-exit at the other use of _get_status_code).
- Around line 51-57: LLMRetryConfig currently only exposes max_retries and
cancel_event, so the retry loop always raises RetryExhaustedError (e.g., in the
function that currently forces that at lines ~307-317) and there's no way to
enable overload-threshold fallback; add a fallback configuration to
LLMRetryConfig (for example a fallback_strategy enum or a boolean like
enable_overload_fallback plus optional fallback parameters such as
overload_threshold and fallback_backoff) and update the retry logic (the routine
that raises RetryExhaustedError) to consult this new config: when overload
conditions are met and fallback is enabled, execute the configured fallback
behavior instead of unconditionally raising RetryExhaustedError. Ensure
references to LLMRetryConfig, max_retries, cancel_event, and the retry
loop/RetryExhaustedError raise site are updated to use the new fields.
- Around line 37-48: _env var PUPPY_MAX_LLM_RETRIES parsed in
_resolve_max_retries currently allows negative or zero values which can lead to
zero-attempt execution and an invalid "retry-exhausted" state (last_error=None);
update _resolve_max_retries to coerce the parsed int to a safe minimum (e.g., if
parsed <= 0, log a warning and return _DEFAULT_MAX_RETRIES or 1) so the function
never returns <= 0, and audit any retry-loop logic that reads this value (areas
around where retries are consumed and last_error is checked) to ensure they rely
on a positive max_retries value.
---
Nitpick comments:
In `@code_puppy/agents/base_agent.py`:
- Around line 1909-1910: Remove the duplicate assignment to
pydantic_agent._toolsets: there are two identical lines setting
pydantic_agent._toolsets = original_toolsets + self._mcp_servers; keep a single
assignment and delete the redundant one so only one update to
pydantic_agent._toolsets remains (refer to pydantic_agent, _toolsets,
original_toolsets, and self._mcp_servers to locate the lines).
In `@tests/test_llm_retry.py`:
- Around line 385-399: Update test_retry_after_header_respected to assert header
precedence by injecting a custom _compute_backoff that would return a different
backoff than the Retry-After header so the test fails if header extraction
regresses: have the factory raise a 429 with a non-zero "retry-after" value,
call llm_run_with_retry(factory, config=LLMRetryConfig(max_retries=3),
_compute_backoff=<custom_callable>) where <custom_callable> returns a sentinel
backoff, and assert the retry used the header-derived value (e.g., by observing
call_count==2 and/or by verifying that the computed backoff came from the header
rather than the injected _compute_backoff). Ensure you reference
test_retry_after_header_respected, llm_run_with_retry, _compute_backoff, and
LLMRetryConfig when making the change.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
code_puppy/agents/base_agent.pycode_puppy/callbacks.pycode_puppy/llm_retry.pytests/test_llm_retry.py
| @dataclass | ||
| class LLMRetryConfig: | ||
| """Configuration for the LLM retry engine.""" | ||
|
|
||
| max_retries: int = field(default_factory=_resolve_max_retries) | ||
| cancel_event: Optional[asyncio.Event] = None | ||
|
|
There was a problem hiding this comment.
Consecutive-529 short-circuit cannot trigger fallback behavior.
Line 312 always raises RetryExhaustedError, and LLMRetryConfig (Line 55-Line 56) has no fallback configuration, so overload-threshold fallback cannot be enabled.
🧩 Suggested API extension
`@dataclass`
class LLMRetryConfig:
"""Configuration for the LLM retry engine."""
max_retries: int = field(default_factory=_resolve_max_retries)
cancel_event: Optional[asyncio.Event] = None
+ fallback_model: Optional[str] = None
+ max_consecutive_overloads: int = _MAX_CONSECUTIVE_OVERLOADS
+
+
+class FallbackTriggeredError(Exception):
+ def __init__(self, fallback_model: str, original_error: Exception):
+ super().__init__(f"Switch to fallback model: {fallback_model}")
+ self.fallback_model = fallback_model
+ self.original_error = original_error
@@
- if overload_hits >= _MAX_CONSECUTIVE_OVERLOADS:
+ if overload_hits >= config.max_consecutive_overloads:
+ if config.fallback_model:
+ raise FallbackTriggeredError(
+ config.fallback_model, error
+ ) from error
raise RetryExhaustedError(
- f"API returned {_MAX_CONSECUTIVE_OVERLOADS} consecutive "
+ f"API returned {config.max_consecutive_overloads} consecutive "
f"overloaded errors",
error,
) from errorAlso applies to: 307-317
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@code_puppy/llm_retry.py` around lines 51 - 57, LLMRetryConfig currently only
exposes max_retries and cancel_event, so the retry loop always raises
RetryExhaustedError (e.g., in the function that currently forces that at lines
~307-317) and there's no way to enable overload-threshold fallback; add a
fallback configuration to LLMRetryConfig (for example a fallback_strategy enum
or a boolean like enable_overload_fallback plus optional fallback parameters
such as overload_threshold and fallback_backoff) and update the retry logic (the
routine that raises RetryExhaustedError) to consult this new config: when
overload conditions are met and fallback is enabled, execute the configured
fallback behavior instead of unconditionally raising RetryExhaustedError. Ensure
references to LLMRetryConfig, max_retries, cancel_event, and the retry
loop/RetryExhaustedError raise site are updated to use the new fields.
There was a problem hiding this comment.
Intentionally not addressed — YAGNI. We originally implemented FallbackTriggeredError and fallback_model config, then deliberately removed them during code review. No fallback model infrastructure exists in Code Puppy today, and adding speculative plumbing for a feature that doesn't exist yet violates YAGNI. When model fallback is needed, it can be added as a focused follow-up with its own tests and config surface.
There was a problem hiding this comment.
@nhicks00, understood! Thank you for the context. Applying YAGNI to avoid speculative plumbing for non-existent infrastructure is the right call—better to add the fallback config surface when the actual fallback model logic lands. I've noted this decision for future reference.
✏️ Learnings added
Learnt from: nhicks00
Repo: mpfaffenberger/code_puppy PR: 211
File: code_puppy/llm_retry.py:59-65
Timestamp: 2026-02-27T03:54:07.263Z
Learning: In code_puppy/llm_retry.py, FallbackTriggeredError and fallback_model configuration were intentionally removed during review. No fallback model infrastructure exists in Code Puppy today, and adding speculative plumbing violates YAGNI. Fallback features should be added later as a focused follow-up when needed.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
| status = _get_status_code(error) | ||
| if status is None: | ||
| # No status code and not a recognized transient pattern — don't retry | ||
| # blindly. Only specifically-identified patterns above are retried. | ||
| return False | ||
|
|
||
| if status == 408: | ||
| return True # Request Timeout | ||
| if status == 409: | ||
| return True # Conflict | ||
| if status == 429: | ||
| return True # Rate Limit | ||
| if status == 401: | ||
| return True # Unauthorized (token may need refresh) | ||
| if status >= 500: | ||
| return True # Server errors | ||
|
|
||
| # 400 (non-overflow), 402, 403, 404, 422, etc. — fatal | ||
| return False |
There was a problem hiding this comment.
Missing context-overflow recovery path for retryable 400 overflow errors.
Line 193 treats 400 as fatal, and Line 328 immediately exits. The required context-overflow handling (parse inputTokens/contextLimit, compute available tokens, set max_tokens=max(3000, available), retry immediately without sleep) is not implemented.
Also applies to: 327-333
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@code_puppy/llm_retry.py` around lines 176 - 194, The retry logic currently
treats status 400 as fatal; update the function that uses _get_status_code (the
is_retryable / should_retry path around _get_status_code(error)) to detect the
specific context-overflow 400 response and implement the context-overflow
recovery: when the error payload indicates inputTokens/contextLimit overflow,
parse inputTokens and contextLimit from the error, compute available =
contextLimit - inputTokens, set the retry request's max_tokens to max(3000,
available) and flag this retry to happen immediately (no sleep/backoff), then
requeue the request; leave other 400s unchanged (still fatal). Ensure this same
handling is added where the code currently exits on 400 in the other retry
branch referenced (the block around the immediate-exit at the other use of
_get_status_code).
There was a problem hiding this comment.
Intentionally not addressed — out of scope. Context-overflow recovery (detecting 400s caused by exceeding context window, then auto-reducing max_tokens or truncating prompts) is a new feature, not a bug in the retry engine. It would require token counting logic, model-specific context limits, and prompt truncation strategies — none of which belong in a retry PR. This would be a good candidate for a separate issue and focused follow-up PR.
There was a problem hiding this comment.
@nhicks00, that's a reasonable architectural decision. You're right that context-overflow recovery involves significant complexity (token counting, model-specific context windows, prompt truncation strategies) that would expand the scope of this retry engine PR considerably.
While issue #210 did reference context-overflow handling, your core retry engine implementation successfully addresses the primary objective: robust exponential backoff and transient failure handling for the identified error classes. Separating the context-overflow feature into a focused follow-up PR will allow for proper design and testing of that specific capability.
Would you like me to open a follow-up issue to track the context-overflow recovery feature? I can draft it with the specific requirements (detect context-window 400s, parse token limits, compute available tokens, adjust max_tokens, immediate retry) and link it back to this PR and #210 for context.
- Clamp negative PUPPY_MAX_LLM_RETRIES values to default with warning - Tighten test_retry_after_header_respected to assert _compute_backoff receives the parsed header value (1.5), not just that retry occurred - Add test_config_negative_env test Addresses CodeRabbit review findings mpfaffenberger#1 and mpfaffenberger#5 on PR mpfaffenberger#211. CodeRabbit findings not addressed (by design): - mpfaffenberger#2 (FallbackTriggeredError): Intentionally removed during code review as YAGNI. No fallback model infrastructure exists today — adding speculative code for a feature that doesn't exist yet violates YAGNI. - mpfaffenberger#3 (context-overflow 400 auto-reduce max_tokens): Out of scope. This is a new feature requiring token counting, model-specific context limits, and prompt truncation — none of which belong in a retry engine PR. - mpfaffenberger#4 (duplicate _toolsets line in base_agent.py): Pre-existing issue not introduced by this PR. Fixing unrelated code here violates minimal-diff principles.
There was a problem hiding this comment.
♻️ Duplicate comments (2)
code_puppy/llm_retry.py (2)
281-287:⚠️ Potential issue | 🟠 MajorClamp/validate
config.max_retriesbefore the loop.A negative value can still be provided directly via
LLMRetryConfig(max_retries=...), which can skip all attempts and hit the tailRetryExhaustedErrorpath withlast_error=None.Proposed hardening
async def llm_run_with_retry( coro_factory: Callable[[], Any], config: Optional[LLMRetryConfig] = None, ) -> Any: @@ if config is None: config = LLMRetryConfig() - max_retries = config.max_retries + max_retries = max(0, config.max_retries) overload_hits = 0 last_error: Optional[Exception] = NoneAlso applies to: 362-365
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@code_puppy/llm_retry.py` around lines 281 - 287, Clamp and validate config.max_retries before using it in the retry loop: ensure the value read from config.max_retries is converted to an int and clamped to a non-negative (e.g., max(0, int(...))) and optionally bounded by a sensible upper limit, then assign that sanitized value to the local max_retries used by the loop (the variable set just above the "for attempt in range(1, max_retries + 2):" line); do the same sanitization wherever you read config.max_retries (also at the later usage around lines 362-365) so the loop cannot be skipped due to a negative value and last_error cannot remain None when RetryExhaustedError is raised.
120-125:⚠️ Potential issue | 🟡 MinorNormalize
x-should-retryvalue parsing.The check is case-sensitive and whitespace-sensitive, so authoritative header hints like
"True"or" false "are ignored.Proposed fix
def _get_x_should_retry(error: Exception) -> Optional[bool]: @@ - val = headers.get("x-should-retry") - if val == "true": + val = headers.get("x-should-retry") + if val is None: + return None + normalized = str(val).strip().lower() + if normalized == "true": return True - if val == "false": + if normalized == "false": return False return None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@code_puppy/llm_retry.py` around lines 120 - 125, The header parsing for "x-should-retry" is currently case- and whitespace-sensitive; normalize the retrieved value (the variable val from headers.get("x-should-retry")) by calling strip() and lower() before comparison so values like " True " or "FALSE" are handled correctly, then return True if normalized == "true", False if normalized == "false", otherwise return None.
🧹 Nitpick comments (1)
code_puppy/llm_retry.py (1)
90-109: Support HTTP-dateRetry-Aftervalues for full RFC compatibility.Right now only numeric seconds are parsed. If the server sends a date-form
Retry-After, this path falls back to computed backoff instead of honoring the header.Proposed refactor
+from datetime import datetime, timezone +from email.utils import parsedate_to_datetime @@ def _get_retry_after(error: Exception) -> Optional[float]: @@ try: return float(val) except (ValueError, TypeError): - return None + try: + dt = parsedate_to_datetime(str(val)) + if dt.tzinfo is None: + dt = dt.replace(tzinfo=timezone.utc) + delta = (dt - datetime.now(timezone.utc)).total_seconds() + return delta if delta > 0 else None + except Exception: + return None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@code_puppy/llm_retry.py` around lines 90 - 109, The _get_retry_after function currently only parses numeric seconds; update it to also accept HTTP-date strings by, after the numeric float attempt, trying to parse the header value with a RFC-compatible parser (e.g., email.utils.parsedate_to_datetime) to produce a timezone-aware datetime, compute seconds = (parsed_datetime - datetime.now(timezone.utc)).total_seconds() and return that seconds value if positive (or 0), otherwise None; ensure you handle parsing exceptions and non-aware datetimes by normalizing to UTC and return None on failure. Reference: function name _get_retry_after, local variable val and headers retrieval logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@code_puppy/llm_retry.py`:
- Around line 281-287: Clamp and validate config.max_retries before using it in
the retry loop: ensure the value read from config.max_retries is converted to an
int and clamped to a non-negative (e.g., max(0, int(...))) and optionally
bounded by a sensible upper limit, then assign that sanitized value to the local
max_retries used by the loop (the variable set just above the "for attempt in
range(1, max_retries + 2):" line); do the same sanitization wherever you read
config.max_retries (also at the later usage around lines 362-365) so the loop
cannot be skipped due to a negative value and last_error cannot remain None when
RetryExhaustedError is raised.
- Around line 120-125: The header parsing for "x-should-retry" is currently
case- and whitespace-sensitive; normalize the retrieved value (the variable val
from headers.get("x-should-retry")) by calling strip() and lower() before
comparison so values like " True " or "FALSE" are handled correctly, then return
True if normalized == "true", False if normalized == "false", otherwise return
None.
---
Nitpick comments:
In `@code_puppy/llm_retry.py`:
- Around line 90-109: The _get_retry_after function currently only parses
numeric seconds; update it to also accept HTTP-date strings by, after the
numeric float attempt, trying to parse the header value with a RFC-compatible
parser (e.g., email.utils.parsedate_to_datetime) to produce a timezone-aware
datetime, compute seconds = (parsed_datetime -
datetime.now(timezone.utc)).total_seconds() and return that seconds value if
positive (or 0), otherwise None; ensure you handle parsing exceptions and
non-aware datetimes by normalizing to UTC and return None on failure. Reference:
function name _get_retry_after, local variable val and headers retrieval logic.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
code_puppy/llm_retry.pytests/test_llm_retry.py
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/test_llm_retry.py
Response to second CodeRabbit review (b4a5c17)1. Clamp 2. Normalize 3. Support HTTP-date |
Summary
Adds
llm_run_with_retry()— a retry wrapper around all threepydantic_agent.run()call sites in
base_agent.py. Handles 429 rate limits, 529 overloads, 5xx servererrors, network failures, and streaming errors with exponential backoff.
code_puppy/llm_retry.py(357 lines)base_agent.py(3 call sites wrapped),callbacks.py(+2 phase types)llm_retry.pyWhat it does
__cause__chain through pydantic_ai → SDK → httpx → OSError)"Streamed response ended"from pydantic_ai)Retry-Afterheaderx-should-retryheaderConfigurable via
PUPPY_MAX_LLM_RETRIESenv var (default 10).Walmart Fork Sync Notes
This PR modifies core code that diverges from the Walmart fork. Here's what
the sync merge looks like:
callbacks.py— Additive merge, no conflicts expected. Walmart fork hasrun_shell_command_output(not in OSS); this PR addsapi_retry_start/api_retry_end. Keep all entries from both sides.base_agent.py— Merge conflict expected at thepydantic_agent.run()callsites. The Walmart fork wraps these with
_run_with_streaming_retry(). Resolution:_run_with_streaming_retry(),MAX_STREAMING_RETRIES, andSTREAMING_RETRY_DELAYSentirelyllm_run_with_retry()wrapper from this PR — it handles the samestreaming error patterns (
"streamed response ended") plus HTTP-level retriesthat
_run_with_streaming_retrynever coveredevent_stream_handler→stream_handlerto match Walmart fork namingexcept* RetryExhaustedErrorhandler into the fork's exception chainllm_retry.pyandtests/test_llm_retry.py— New files, land cleanly.Design Note: Why core, not a plugin
We considered making retry a plugin via a
wrap_llm_callcallback, which wouldlet the Walmart fork register a gateway-aware retry strategy without merge conflicts.
We opted against it because the existing callback system is fire-and-forget — it
doesn't support middleware-style coroutine wrapping without significant plumbing
changes. A utility function in core is the simplest approach that works.
For Walmart-specific gateway intelligence (e.g., detecting Vertex AI per-minute
quota windows, per-model quota tracking), a Walmart fork plugin can register an
api_retry_startcallback to layer on top of the core retry engine. The hooksare already in place for this — no core changes needed.
Test plan
ModelHTTPError(status_code=429)) — retries and recoversruff format/ruff checkcleanResolves #210
Summary by CodeRabbit
New Features
Tests