Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e418cc14d1
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if Retry.short_lived_exit?(running_entry) do | ||
| runtime_ms = Retry.running_entry_runtime_ms(running_entry) | ||
|
|
||
| updated_state = | ||
| state | ||
| |> complete_issue(issue_id) | ||
| |> reconcile_completed_issue_lifecycle(issue_id, running_entry) | ||
| |> maybe_schedule_continuation_retry(issue_id, running_entry) | ||
| Logger.warning( | ||
| "Short-lived normal exit for issue_id=#{issue_id} session_id=#{session_id}" <> |
There was a problem hiding this comment.
Keep max_turns-limited runs on the continuation path
When a worker exits normally in under short_run_threshold_ms, this branch now treats it as a transient failure unconditionally. That breaks the supported agent.max_turns continuation flow in AgentRunner.maybe_continue_after_turn/5: with agent.max_turns: 1 (or any fast single-turn workflow), every healthy turn that finishes in under a minute gets delayed by min_retry_interval_ms instead of taking the usual continuation retry, effectively throttling issue progress to one turn per minute.
Useful? React with 👍 / 👎.
| if metadata[:delay_type] == :continuation do | ||
| delay_ms | ||
| else | ||
| failure_retry_delay(attempt) | ||
| config_floor = Config.settings!().agent.min_retry_interval_ms | ||
| rate_limit_floor = Map.get(metadata, :retry_after_ms, 0) |
There was a problem hiding this comment.
Preserve delay metadata when retries are re-queued
This new floor logic only works while metadata[:delay_type] and metadata[:retry_after_ms] survive in the queued retry entry, but schedule_issue_retry/4/pop_retry_attempt_state/3 still persist only the older identifier/error/path fields. After a continuation retry wakes up and handle_active_retry/4 re-queues it because slots are full (or a retry poll fails), the next attempt is reclassified as a generic failure and picks up the 60s floor; any rate-limit-derived floor is also dropped on subsequent requeues.
Useful? React with 👍 / 👎.
| entry_retry_after = Retry.extract_retry_after_ms(Map.get(running_entry, :last_rate_limits)) | ||
| state_retry_after = Retry.extract_retry_after_ms(state.worker_rate_limits) | ||
| max(entry_retry_after, state_retry_after) |
There was a problem hiding this comment.
Store an observed-at timestamp with rate-limit snapshots
reset_in_seconds is a relative countdown from when the worker update was emitted, but compute_retry_after_ms/2 reuses the raw snapshot with no observation time. If a run receives reset_in_seconds: 60, keeps working for another 45s, and then exits short-lived, we still schedule a fresh 60s delay instead of the ~15s remaining. Because state.worker_rate_limits is also reused globally, that stale floor can leak into later unrelated retries as well.
Useful? React with 👍 / 👎.
…(SYM-33) When the Anthropic API returns a rate limit (429), workers exit after ~30s with :normal status. The orchestrator treated this as normal completion and immediately scheduled a continuation retry, creating a tight spin loop (54 wasted spawns in 10 minutes). - Add `short_run_threshold_ms` (default 60s) and `min_retry_interval_ms` (default 60s) to the agent config schema - Workers that exit normally but ran below the threshold are classified as transient failures and get exponential backoff instead of immediate continuation retries - The `min_retry_interval_ms` floor prevents retrying faster than once per minute regardless of backoff calculation - Store per-running-entry rate limit info from worker update events (`last_rate_limits` field on the running entry) - Extract `retry_after_ms` from rate limit bucket data (`reset_in_seconds` on exhausted buckets where `remaining <= 0`) - Use the maximum of config floor, exponential backoff, and rate-limit retry-after as the actual delay — so a 120s API reset window overrides the 60s default floor - Fall back to state-level `worker_rate_limits` when the exiting worker has no per-entry rate limit data - 6 new tests covering rate-limit-aware backoff, state-level fallback, per-entry rate limit storage, and extract_retry_after_ms edge cases - Fix pre-existing stall-retry test broken by min_retry_interval_ms floor Closes SYM-33
e418cc1 to
0a15e43
Compare
Automated Symphony PR for SYM-33.
Tracker issue: https://linear.app/tmustier/issue/SYM-33/rate-limit-errors-cause-tight-retry-spin-loop-54-wasted-worker-spawns
Source branch: fix/sym-33-rate-limit-spin-loop
Base branch: main
Rollout mode: mutate