perf(serve): reduce worker/router event-loop lag by mikasenghaas · Pull Request #1453 · PrimeIntellect-ai/verifiers

mikasenghaas · 2026-05-23T23:11:39Z

Summary

Reduce env_worker / env_router event-loop lag. Pairs with the prime-rl PR (perf/r3-rewrite); verifiers changes are on the request-handling side (env_router / env_worker / response unpack).

Coupled with PrimeIntellect-ai/prime-rl#2609 (perf/r3-rewrite). The prime-rl PR fixes the orch-side counterpart of the same lag and bumps the verifiers submodule to this branch's head. Land or revert together — without the prime-rl side the orch loop saturates first under 256-rollout concurrency and these env-side gains don't surface.

What's in here

parse_response_tokens off the loop (46588cf7, re-added in 68b789ca after a brief revert) — the body now runs in a worker thread (asyncio.to_thread). Was running on the env_worker loop, blocking recv handlers during heavy parse work. One brittle test (test_update_and_reward_children_can_share_borrowed_live_tools) was dropped — it asserted a specific scheduling order against an order-sensitive mock client; the borrowed-tools-sharing contract is still exercised by other tests in the same file.
state_to_output off the loop (11befd7d) — same idea, env-side. The message serialization to vf.RolloutOutput is CPU heavy enough that running it on the loop noticeably starved other recv tasks. Applied in both Environment.run_rollout and Environment.run_group (group states processed concurrently via asyncio.gather).
uvloop in env_worker + env_server (4c89d700) — drop-in scheduler-overhead win. Declared as a direct verifiers dep in 31b3830b, with a non-Windows / non-PyPy platform marker in 5d6585c2 (uvloop has no wheels for those).
Scale thread pool + offload request unpack (445f3797):
- Bumped the default to_thread executor on env_worker / env_server to 512 — default 32 was bottlenecking 256 concurrent rollouts.
- msgpack.unpackb + Pydantic model_validate for incoming requests moved to to_thread.
Cleanup commits: 327fad0e drops intervention-ID prefixes from comments and nests the parse_response_tokens sync body as a closure; f7d7acb0 strips the explanatory perf paragraphs (commit messages carry the why); 34d1a63e ruff fix + format.

What's NOT in here (dropped from earlier iterations)

env_router on_response asyncio.gather batching (was part of 445f3797, reverted in ed5d78fb) — drained all ready response frames first, then dispatched concurrently. Real wallclock cost for unclear value once the per-rollout offloads landed.

How to validate

Pair with the prime-rl PR and watch the env_worker Lag: line for max=. With these changes the typical max env-worker lag should stay under a few hundred ms even under 256 concurrent rollouts; before this stack we were seeing 6 s+ peaks.

Note

Reduce event-loop lag in serve workers by offloading CPU-bound work to threads

Offloads state_to_output calls in Environment.run_rollout and Environment.run_group to threads via asyncio.to_thread, with group states processed concurrently using asyncio.gather.
Offloads msgpack unpacking and Pydantic model_validate in the EnvWorker request handler, and token parsing in parse_response_tokens, to the threadpool.
Scales the default threadpool executor to concurrency=512 in both EnvWorker and EnvServer, and installs uvloop when available.

Changes since #1453 opened

Added pytest test validating parallel child rollouts sharing borrowed tools [73e4681]
Added update and reward functions spawning child rollouts with borrowed tools [73e4681]
Implemented deterministic test double model client routing responses by prompt content [73e4681]

📊 Macroscope summarized 445f379. 5 files reviewed, 1 issue evaluated, 0 issues filtered, 1 comment posted

🗂️ Filtered Issues

Note

Medium Risk
Touches the env server/worker request path and rollout output/token parsing, introducing more asyncio.to_thread usage and larger thread pools; risk is mainly around concurrency behavior, CPU/memory overhead, and subtle ordering/test assumptions.

Overview
Reduces env server/worker event-loop blocking by moving CPU-heavy work off the loop: state_to_output in Environment.run_rollout/run_group, token parsing in parse_response_tokens, and request deserialization (msgpack.unpackb + Pydantic model_validate) in EnvWorker now run via asyncio.to_thread.

Improves runtime scheduling by optionally installing uvloop in EnvServer.run_server/EnvWorker.run_worker and scaling the default executor to handle high to_thread concurrency (set to 512). Updates a lifecycle test to be robust to parallel asyncio.gather interleaving by using a routed mock client and removing order-dependent assertions.

^{Reviewed by Cursor Bugbot for commit 73e4681. Bugbot is set up for automated code reviews on this repo. Configure here.}

parse_response_tokens is per-turn × per-rollout (up to ~100 × ~256 per worker concurrently). The body is pure-Python list slicing + dict build but accumulates real lag on the event loop under heavy concurrency. Split into a sync `_parse_response_tokens_sync` and keep the async wrapper that now delegates via asyncio.to_thread. Call sites unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

state_to_output walks the full trajectory dict + sanitizes messages per rollout — can run 100s of ms on long rollouts. Moved off the loop in both run_rollout and run_group so the worker stays responsive when many rollouts finish near simultaneously. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Both processes run a single asyncio loop juggling many concurrent rollouts / many ZMQ I/O ops; the default selector loop becomes the latency floor before any single operation does. uvloop drops the floor materially. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The env_worker / env_server runtime imports are wrapped in try/except, so declaring uvloop here would only cause uv.lock churn and trigger per-worker .venv resync on shared filesystems. uvloop is already part of the prime-rl dependency tree and resolved by the workspace. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…h router sends Round 1 wrapped parse_response_tokens + state_to_output in to_thread but env worker / server lag barely moved. Root cause: the default asyncio thread pool is min(32, cpu_count+4) which is 32 here. With ~256 concurrent rollouts per worker, those to_thread calls serialize across 32 threads — queuing eats most of the win. R1 (env_worker.py, env_server.py): call scale_executors(concurrency=512) before install_default_executor() so the loop's default executor has real headroom. 512 threads are cheap when idle and give 2x the peak concurrency. R2 (env_worker.py): wrap incoming-request msgpack.unpackb + Pydantic model_validate in asyncio.to_thread. These ran on the loop and Pydantic validation of RunRolloutRequest is non-trivial when many requests land at once. R4 (env_router.py): drain all ready response frames first, then dispatch on_response via asyncio.gather instead of serial await. The prior loop awaited a TCP send per response which compounds when many workers finish near simultaneously. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Reverts the asyncio.gather batching of on_response dispatches in EnvRouter.run, restoring the serial `await on_response(...)` per response. R1 (scale_executors) and R2 (msgpack/Pydantic to_thread) from 445f379 are kept. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…body - Strip R1/R2/E1/E4 intervention IDs from comments in env_server.py, env_worker.py, response_utils.py. Reword the remaining comments to stand on their own without referencing the rollout plan. - Move `_parse_response_tokens_sync` inside `parse_response_tokens` as a nested closure. Same to_thread offload, no module-level helper exposed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

verifiers' env_worker / env_server install uvloop at startup; declaring it as a direct dep makes that intent explicit instead of relying on prime-rl's workspace tree to transitively pull it in. No uv.lock churn — uvloop was already resolved via the workspace. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Drop unused `from typing import Any` in response_utils.py. - Reformat split imports and blank lines per ruff. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

uvloop ships no wheels for Windows / Cygwin and its C extension fails to compile there; PyPy is similarly unsupported. Restrict the direct dep to platforms uvloop actually targets — the runtime import is already wrapped in try/except, so a missing uvloop is a graceful degradation, not a crash. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>