SYM-33: Rate limit errors cause tight retry spin loop — 54 wasted worker spawns in 10 minutes by tmustier · Pull Request #76 · tmustier/pi-symphony

tmustier · 2026-03-22T08:13:53Z

Automated Symphony PR for SYM-33.

Tracker issue: https://linear.app/tmustier/issue/SYM-33/rate-limit-errors-cause-tight-retry-spin-loop-54-wasted-worker-spawns
Source branch: fix/sym-33-rate-limit-spin-loop
Base branch: main
Rollout mode: mutate

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e418cc14d1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-22T08:19:01Z

+    if Retry.short_lived_exit?(running_entry) do
+      runtime_ms = Retry.running_entry_runtime_ms(running_entry)

-    updated_state =
-      state
-      |> complete_issue(issue_id)
-      |> reconcile_completed_issue_lifecycle(issue_id, running_entry)
-      |> maybe_schedule_continuation_retry(issue_id, running_entry)
+      Logger.warning(
+        "Short-lived normal exit for issue_id=#{issue_id} session_id=#{session_id}" <>


Keep max_turns-limited runs on the continuation path

When a worker exits normally in under short_run_threshold_ms, this branch now treats it as a transient failure unconditionally. That breaks the supported agent.max_turns continuation flow in AgentRunner.maybe_continue_after_turn/5: with agent.max_turns: 1 (or any fast single-turn workflow), every healthy turn that finishes in under a minute gets delayed by min_retry_interval_ms instead of taking the usual continuation retry, effectively throttling issue progress to one turn per minute.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-22T08:19:01Z

+    if metadata[:delay_type] == :continuation do
+      delay_ms
    else
-      failure_retry_delay(attempt)
+      config_floor = Config.settings!().agent.min_retry_interval_ms
+      rate_limit_floor = Map.get(metadata, :retry_after_ms, 0)


Preserve delay metadata when retries are re-queued

This new floor logic only works while metadata[:delay_type] and metadata[:retry_after_ms] survive in the queued retry entry, but schedule_issue_retry/4/pop_retry_attempt_state/3 still persist only the older identifier/error/path fields. After a continuation retry wakes up and handle_active_retry/4 re-queues it because slots are full (or a retry poll fails), the next attempt is reclassified as a generic failure and picks up the 60s floor; any rate-limit-derived floor is also dropped on subsequent requeues.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-22T08:19:01Z

+    entry_retry_after = Retry.extract_retry_after_ms(Map.get(running_entry, :last_rate_limits))
+    state_retry_after = Retry.extract_retry_after_ms(state.worker_rate_limits)
+    max(entry_retry_after, state_retry_after)


Store an observed-at timestamp with rate-limit snapshots

reset_in_seconds is a relative countdown from when the worker update was emitted, but compute_retry_after_ms/2 reuses the raw snapshot with no observation time. If a run receives reset_in_seconds: 60, keeps working for another 45s, and then exits short-lived, we still schedule a fresh 60s delay instead of the ~15s remaining. Because state.worker_rate_limits is also reused globally, that stale floor can leak into later unrelated retries as well.

Useful? React with 👍 / 👎.

…(SYM-33) When the Anthropic API returns a rate limit (429), workers exit after ~30s with :normal status. The orchestrator treated this as normal completion and immediately scheduled a continuation retry, creating a tight spin loop (54 wasted spawns in 10 minutes). - Add `short_run_threshold_ms` (default 60s) and `min_retry_interval_ms` (default 60s) to the agent config schema - Workers that exit normally but ran below the threshold are classified as transient failures and get exponential backoff instead of immediate continuation retries - The `min_retry_interval_ms` floor prevents retrying faster than once per minute regardless of backoff calculation - Store per-running-entry rate limit info from worker update events (`last_rate_limits` field on the running entry) - Extract `retry_after_ms` from rate limit bucket data (`reset_in_seconds` on exhausted buckets where `remaining <= 0`) - Use the maximum of config floor, exponential backoff, and rate-limit retry-after as the actual delay — so a 120s API reset window overrides the 60s default floor - Fall back to state-level `worker_rate_limits` when the exiting worker has no per-entry rate limit data - 6 new tests covering rate-limit-aware backoff, state-level fallback, per-entry rate limit storage, and extract_retry_after_ms edge cases - Fix pre-existing stall-retry test broken by min_retry_interval_ms floor Closes SYM-33

chatgpt-codex-connector Bot reviewed Mar 22, 2026

View reviewed changes

tmustier added 2 commits March 22, 2026 08:20

fix: resolve merge conflicts, remove unreachable clauses

0a15e43

tmustier force-pushed the fix/sym-33-rate-limit-spin-loop branch from e418cc1 to 0a15e43 Compare March 22, 2026 08:26

tmustier merged commit 0d053f3 into main Mar 22, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SYM-33: Rate limit errors cause tight retry spin loop — 54 wasted worker spawns in 10 minutes#76

SYM-33: Rate limit errors cause tight retry spin loop — 54 wasted worker spawns in 10 minutes#76
tmustier merged 2 commits intomainfrom
fix/sym-33-rate-limit-spin-loop

tmustier commented Mar 22, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Mar 22, 2026

Uh oh!

chatgpt-codex-connector Bot Mar 22, 2026

Uh oh!

chatgpt-codex-connector Bot Mar 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tmustier commented Mar 22, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant