perf(serve): reduce worker/router event-loop lag#1453
Merged
Conversation
parse_response_tokens is per-turn × per-rollout (up to ~100 × ~256 per worker concurrently). The body is pure-Python list slicing + dict build but accumulates real lag on the event loop under heavy concurrency. Split into a sync `_parse_response_tokens_sync` and keep the async wrapper that now delegates via asyncio.to_thread. Call sites unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
state_to_output walks the full trajectory dict + sanitizes messages per rollout — can run 100s of ms on long rollouts. Moved off the loop in both run_rollout and run_group so the worker stays responsive when many rollouts finish near simultaneously. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both processes run a single asyncio loop juggling many concurrent rollouts / many ZMQ I/O ops; the default selector loop becomes the latency floor before any single operation does. uvloop drops the floor materially. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The env_worker / env_server runtime imports are wrapped in try/except, so declaring uvloop here would only cause uv.lock churn and trigger per-worker .venv resync on shared filesystems. uvloop is already part of the prime-rl dependency tree and resolved by the workspace. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…h router sends Round 1 wrapped parse_response_tokens + state_to_output in to_thread but env worker / server lag barely moved. Root cause: the default asyncio thread pool is min(32, cpu_count+4) which is 32 here. With ~256 concurrent rollouts per worker, those to_thread calls serialize across 32 threads — queuing eats most of the win. R1 (env_worker.py, env_server.py): call scale_executors(concurrency=512) before install_default_executor() so the loop's default executor has real headroom. 512 threads are cheap when idle and give 2x the peak concurrency. R2 (env_worker.py): wrap incoming-request msgpack.unpackb + Pydantic model_validate in asyncio.to_thread. These ran on the loop and Pydantic validation of RunRolloutRequest is non-trivial when many requests land at once. R4 (env_router.py): drain all ready response frames first, then dispatch on_response via asyncio.gather instead of serial await. The prior loop awaited a TCP send per response which compounds when many workers finish near simultaneously. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reverts the asyncio.gather batching of on_response dispatches in EnvRouter.run, restoring the serial `await on_response(...)` per response. R1 (scale_executors) and R2 (msgpack/Pydantic to_thread) from 445f379 are kept. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…body - Strip R1/R2/E1/E4 intervention IDs from comments in env_server.py, env_worker.py, response_utils.py. Reword the remaining comments to stand on their own without referencing the rollout plan. - Move `_parse_response_tokens_sync` inside `parse_response_tokens` as a nested closure. Same to_thread offload, no module-level helper exposed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
verifiers' env_worker / env_server install uvloop at startup; declaring it as a direct dep makes that intent explicit instead of relying on prime-rl's workspace tree to transitively pull it in. No uv.lock churn — uvloop was already resolved via the workspace. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Drop unused `from typing import Any` in response_utils.py. - Reformat split imports and blank lines per ruff. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
uvloop ships no wheels for Windows / Cygwin and its C extension fails to compile there; PyPy is similarly unsupported. Restrict the direct dep to platforms uvloop actually targets — the runtime import is already wrapped in try/except, so a missing uvloop is a graceful degradation, not a crash. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mikasenghaas
added a commit
to PrimeIntellect-ai/prime-rl
that referenced
this pull request
May 23, 2026
Pulls in the env_worker / env_router event-loop lag fixes from PrimeIntellect-ai/verifiers#1453. Required for the orch-side gains in this PR to be visible under realistic 256-rollout concurrency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The to_thread wrapper introduced a mid-rollout event-loop yield that broke test_update_and_reward_children_can_share_borrowed_live_tools. The test gathers parallel update children + a reward child against a CapturingModelClient that hands out canned responses in arrival order; the extra yield reshuffled which coroutine consumed which frame and the reward child landed on the wrong response. The other lag fixes in this PR (uvloop, scale_executors=512, msgpack unpack to_thread, state_to_output to_thread) don't touch per-request scheduling, so the env_worker lag improvement holds without this piece. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mikasenghaas
added a commit
to PrimeIntellect-ai/prime-rl
that referenced
this pull request
May 24, 2026
Pulls in the env_worker / env_router event-loop lag fixes from PrimeIntellect-ai/verifiers#1453. Required for the orch-side gains in this PR to be visible under realistic 256-rollout concurrency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mikasenghaas
added a commit
to PrimeIntellect-ai/prime-rl
that referenced
this pull request
May 24, 2026
Pulls in the env_worker / env_router event-loop lag fixes from PrimeIntellect-ai/verifiers#1453. Required for the orch-side gains in this PR to be visible under realistic 256-rollout concurrency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mikasenghaas
added a commit
to PrimeIntellect-ai/prime-rl
that referenced
this pull request
May 24, 2026
Pulls in the env_worker / env_router event-loop lag fixes from PrimeIntellect-ai/verifiers#1453. Required for the orch-side gains in this PR to be visible under realistic 256-rollout concurrency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The parse is pure-Python list slicing + dict build, but it runs per-turn for every concurrent rollout (~100 turns × ~256 rollouts/worker), so wallclock contention on the event loop adds up. Offload the body to a worker thread via asyncio.to_thread so the loop stays responsive for ZMQ I/O and other tasks. Drop test_update_and_reward_children_can_share_borrowed_live_tools. The test exercised parallel children + a reward child against a CapturingModelClient that hands out canned responses in arrival order; the to_thread yield reshuffled which coroutine consumed which frame. The actual borrowed-tools-sharing contract is still covered by other tests in the same file — only the order-sensitive variant is gone. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mikasenghaas
added a commit
to PrimeIntellect-ai/prime-rl
that referenced
this pull request
May 24, 2026
Pulls in the env_worker / env_router event-loop lag fixes from PrimeIntellect-ai/verifiers#1453. Required for the orch-side gains in this PR to be visible under realistic 256-rollout concurrency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mikasenghaas
added a commit
to PrimeIntellect-ai/prime-rl
that referenced
this pull request
May 24, 2026
Pulls in the env_worker / env_router event-loop lag fixes from PrimeIntellect-ai/verifiers#1453. Required for the orch-side gains in this PR to be visible under realistic 256-rollout concurrency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop the inline rationale paragraphs around state_to_output to_thread, msgpack/Pydantic unpack offload, uvloop install, and parse_response_tokens offload. Code is self-explanatory; commit messages carry the why. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mikasenghaas
added a commit
to PrimeIntellect-ai/prime-rl
that referenced
this pull request
May 24, 2026
Pulls in the env_worker / env_router event-loop lag fixes from PrimeIntellect-ai/verifiers#1453. Required for the orch-side gains in this PR to be visible under realistic 256-rollout concurrency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mikasenghaas
added a commit
to PrimeIntellect-ai/prime-rl
that referenced
this pull request
May 24, 2026
Pulls in the env_worker / env_router event-loop lag fixes from PrimeIntellect-ai/verifiers#1453. Required for the orch-side gains in this PR to be visible under realistic 256-rollout concurrency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ApprovabilityVerdict: Approved Performance optimization that offloads CPU-bound parsing to thread pools and uses uvloop for faster async event loops. The functional logic remains unchanged - only execution scheduling is affected. Standard async patterns applied consistently across the serve/worker subsystem. You can customize Macroscope's approvability policy. Learn more. |
mikasenghaas
added a commit
to PrimeIntellect-ai/prime-rl
that referenced
this pull request
May 24, 2026
Pulls in the env_worker / env_router event-loop lag fixes from PrimeIntellect-ai/verifiers#1453. Required for the orch-side gains in this PR to be visible under realistic 256-rollout concurrency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mikasenghaas
added a commit
to PrimeIntellect-ai/prime-rl
that referenced
this pull request
May 24, 2026
Pulls in the env_worker / env_router event-loop lag fixes from PrimeIntellect-ai/verifiers#1453. Required for the orch-side gains in this PR to be visible under realistic 256-rollout concurrency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
S1ro1
reviewed
May 24, 2026
… mock The previous test_update_and_reward_children_can_share_borrowed_live_tools was dropped when the parse_response_tokens to_thread offload was re-added, because it depended on a sequential canned-response list. With parallel asyncio.gather children, the to_thread yield reshuffled which coroutine consumed which response — a test-fixture artifact, since real model clients return responses keyed to each request. Rebuild the test with RoutedModelClient: it inspects each request's conversation (first user message + presence of tool-role messages) and returns the appropriate canned response, independent of call order. Resulting test exercises the same contract (parallel update children + reward child all share the borrowed tool, all three values land in state) but is robust to event-loop interleaving. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mikasenghaas
added a commit
to PrimeIntellect-ai/prime-rl
that referenced
this pull request
May 24, 2026
Pulls in the env_worker / env_router event-loop lag fixes from PrimeIntellect-ai/verifiers#1453. Required for the orch-side gains in this PR to be visible under realistic 256-rollout concurrency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Reduce env_worker / env_router event-loop lag. Pairs with the prime-rl PR (
perf/r3-rewrite); verifiers changes are on the request-handling side (env_router / env_worker / response unpack).What's in here
46588cf7, re-added in68b789caafter a brief revert) — the body now runs in a worker thread (asyncio.to_thread). Was running on the env_worker loop, blocking recv handlers during heavy parse work. One brittle test (test_update_and_reward_children_can_share_borrowed_live_tools) was dropped — it asserted a specific scheduling order against an order-sensitive mock client; the borrowed-tools-sharing contract is still exercised by other tests in the same file.11befd7d) — same idea, env-side. The message serialization tovf.RolloutOutputis CPU heavy enough that running it on the loop noticeably starved other recv tasks. Applied in bothEnvironment.run_rolloutandEnvironment.run_group(group states processed concurrently viaasyncio.gather).4c89d700) — drop-in scheduler-overhead win. Declared as a direct verifiers dep in31b3830b, with a non-Windows / non-PyPy platform marker in5d6585c2(uvloop has no wheels for those).445f3797):to_threadexecutor on env_worker / env_server to 512 — default 32 was bottlenecking 256 concurrent rollouts.msgpack.unpackb+ Pydanticmodel_validatefor incoming requests moved toto_thread.327fad0edrops intervention-ID prefixes from comments and nests the parse_response_tokens sync body as a closure;f7d7acb0strips the explanatory perf paragraphs (commit messages carry the why);34d1a63eruff fix + format.What's NOT in here (dropped from earlier iterations)
env_routeron_responseasyncio.gatherbatching (was part of445f3797, reverted ined5d78fb) — drained all ready response frames first, then dispatched concurrently. Real wallclock cost for unclear value once the per-rollout offloads landed.How to validate
Pair with the prime-rl PR and watch the env_worker
Lag:line formax=. With these changes the typical max env-worker lag should stay under a few hundred ms even under 256 concurrent rollouts; before this stack we were seeing 6 s+ peaks.Note
Reduce event-loop lag in serve workers by offloading CPU-bound work to threads
state_to_outputcalls inEnvironment.run_rolloutandEnvironment.run_groupto threads viaasyncio.to_thread, with group states processed concurrently usingasyncio.gather.model_validatein theEnvWorkerrequest handler, and token parsing inparse_response_tokens, to the threadpool.concurrency=512in bothEnvWorkerandEnvServer, and installs uvloop when available.Changes since #1453 opened
📊 Macroscope summarized 445f379. 5 files reviewed, 1 issue evaluated, 0 issues filtered, 1 comment posted
🗂️ Filtered Issues
Note
Medium Risk
Touches the env server/worker request path and rollout output/token parsing, introducing more
asyncio.to_threadusage and larger thread pools; risk is mainly around concurrency behavior, CPU/memory overhead, and subtle ordering/test assumptions.Overview
Reduces env server/worker event-loop blocking by moving CPU-heavy work off the loop:
state_to_outputinEnvironment.run_rollout/run_group, token parsing inparse_response_tokens, and request deserialization (msgpack.unpackb+ Pydanticmodel_validate) inEnvWorkernow run viaasyncio.to_thread.Improves runtime scheduling by optionally installing
uvloopinEnvServer.run_server/EnvWorker.run_workerand scaling the default executor to handle highto_threadconcurrency (set to512). Updates a lifecycle test to be robust to parallelasyncio.gatherinterleaving by using a routed mock client and removing order-dependent assertions.Reviewed by Cursor Bugbot for commit 73e4681. Bugbot is set up for automated code reviews on this repo. Configure here.