Skip to content

SYM-33: Rate limit errors cause tight retry spin loop — 54 wasted worker spawns in 10 minutes#76

Merged
tmustier merged 2 commits intomainfrom
fix/sym-33-rate-limit-spin-loop
Mar 22, 2026
Merged

SYM-33: Rate limit errors cause tight retry spin loop — 54 wasted worker spawns in 10 minutes#76
tmustier merged 2 commits intomainfrom
fix/sym-33-rate-limit-spin-loop

Conversation

@tmustier
Copy link
Copy Markdown
Owner

Automated Symphony PR for SYM-33.

Tracker issue: https://linear.app/tmustier/issue/SYM-33/rate-limit-errors-cause-tight-retry-spin-loop-54-wasted-worker-spawns
Source branch: fix/sym-33-rate-limit-spin-loop
Base branch: main
Rollout mode: mutate

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e418cc14d1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1306 to +1310
if Retry.short_lived_exit?(running_entry) do
runtime_ms = Retry.running_entry_runtime_ms(running_entry)

updated_state =
state
|> complete_issue(issue_id)
|> reconcile_completed_issue_lifecycle(issue_id, running_entry)
|> maybe_schedule_continuation_retry(issue_id, running_entry)
Logger.warning(
"Short-lived normal exit for issue_id=#{issue_id} session_id=#{session_id}" <>
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep max_turns-limited runs on the continuation path

When a worker exits normally in under short_run_threshold_ms, this branch now treats it as a transient failure unconditionally. That breaks the supported agent.max_turns continuation flow in AgentRunner.maybe_continue_after_turn/5: with agent.max_turns: 1 (or any fast single-turn workflow), every healthy turn that finishes in under a minute gets delayed by min_retry_interval_ms instead of taking the usual continuation retry, effectively throttling issue progress to one turn per minute.

Useful? React with 👍 / 👎.

Comment on lines +133 to +137
if metadata[:delay_type] == :continuation do
delay_ms
else
failure_retry_delay(attempt)
config_floor = Config.settings!().agent.min_retry_interval_ms
rate_limit_floor = Map.get(metadata, :retry_after_ms, 0)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve delay metadata when retries are re-queued

This new floor logic only works while metadata[:delay_type] and metadata[:retry_after_ms] survive in the queued retry entry, but schedule_issue_retry/4/pop_retry_attempt_state/3 still persist only the older identifier/error/path fields. After a continuation retry wakes up and handle_active_retry/4 re-queues it because slots are full (or a retry poll fails), the next attempt is reclassified as a generic failure and picks up the 60s floor; any rate-limit-derived floor is also dropped on subsequent requeues.

Useful? React with 👍 / 👎.

Comment on lines +1685 to +1687
entry_retry_after = Retry.extract_retry_after_ms(Map.get(running_entry, :last_rate_limits))
state_retry_after = Retry.extract_retry_after_ms(state.worker_rate_limits)
max(entry_retry_after, state_retry_after)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Store an observed-at timestamp with rate-limit snapshots

reset_in_seconds is a relative countdown from when the worker update was emitted, but compute_retry_after_ms/2 reuses the raw snapshot with no observation time. If a run receives reset_in_seconds: 60, keeps working for another 45s, and then exits short-lived, we still schedule a fresh 60s delay instead of the ~15s remaining. Because state.worker_rate_limits is also reused globally, that stale floor can leak into later unrelated retries as well.

Useful? React with 👍 / 👎.

…(SYM-33)

When the Anthropic API returns a rate limit (429), workers exit after ~30s
with :normal status. The orchestrator treated this as normal completion and
immediately scheduled a continuation retry, creating a tight spin loop
(54 wasted spawns in 10 minutes).

- Add `short_run_threshold_ms` (default 60s) and `min_retry_interval_ms`
  (default 60s) to the agent config schema
- Workers that exit normally but ran below the threshold are classified
  as transient failures and get exponential backoff instead of immediate
  continuation retries
- The `min_retry_interval_ms` floor prevents retrying faster than once
  per minute regardless of backoff calculation

- Store per-running-entry rate limit info from worker update events
  (`last_rate_limits` field on the running entry)
- Extract `retry_after_ms` from rate limit bucket data (`reset_in_seconds`
  on exhausted buckets where `remaining <= 0`)
- Use the maximum of config floor, exponential backoff, and rate-limit
  retry-after as the actual delay — so a 120s API reset window overrides
  the 60s default floor
- Fall back to state-level `worker_rate_limits` when the exiting worker
  has no per-entry rate limit data

- 6 new tests covering rate-limit-aware backoff, state-level fallback,
  per-entry rate limit storage, and extract_retry_after_ms edge cases
- Fix pre-existing stall-retry test broken by min_retry_interval_ms floor

Closes SYM-33
@tmustier tmustier force-pushed the fix/sym-33-rate-limit-spin-loop branch from e418cc1 to 0a15e43 Compare March 22, 2026 08:26
@tmustier tmustier merged commit 0d053f3 into main Mar 22, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant