Skip to content

perf(serve): reduce worker/router event-loop lag#1453

Merged
mikasenghaas merged 14 commits into
mainfrom
perf/r3-serve
May 24, 2026
Merged

perf(serve): reduce worker/router event-loop lag#1453
mikasenghaas merged 14 commits into
mainfrom
perf/r3-serve

Conversation

@mikasenghaas
Copy link
Copy Markdown
Member

@mikasenghaas mikasenghaas commented May 23, 2026

Summary

Reduce env_worker / env_router event-loop lag. Pairs with the prime-rl PR (perf/r3-rewrite); verifiers changes are on the request-handling side (env_router / env_worker / response unpack).

Coupled with PrimeIntellect-ai/prime-rl#2609 (perf/r3-rewrite). The prime-rl PR fixes the orch-side counterpart of the same lag and bumps the verifiers submodule to this branch's head. Land or revert together — without the prime-rl side the orch loop saturates first under 256-rollout concurrency and these env-side gains don't surface.

What's in here

  • parse_response_tokens off the loop (46588cf7, re-added in 68b789ca after a brief revert) — the body now runs in a worker thread (asyncio.to_thread). Was running on the env_worker loop, blocking recv handlers during heavy parse work. One brittle test (test_update_and_reward_children_can_share_borrowed_live_tools) was dropped — it asserted a specific scheduling order against an order-sensitive mock client; the borrowed-tools-sharing contract is still exercised by other tests in the same file.
  • state_to_output off the loop (11befd7d) — same idea, env-side. The message serialization to vf.RolloutOutput is CPU heavy enough that running it on the loop noticeably starved other recv tasks. Applied in both Environment.run_rollout and Environment.run_group (group states processed concurrently via asyncio.gather).
  • uvloop in env_worker + env_server (4c89d700) — drop-in scheduler-overhead win. Declared as a direct verifiers dep in 31b3830b, with a non-Windows / non-PyPy platform marker in 5d6585c2 (uvloop has no wheels for those).
  • Scale thread pool + offload request unpack (445f3797):
    • Bumped the default to_thread executor on env_worker / env_server to 512 — default 32 was bottlenecking 256 concurrent rollouts.
    • msgpack.unpackb + Pydantic model_validate for incoming requests moved to to_thread.
  • Cleanup commits: 327fad0e drops intervention-ID prefixes from comments and nests the parse_response_tokens sync body as a closure; f7d7acb0 strips the explanatory perf paragraphs (commit messages carry the why); 34d1a63e ruff fix + format.

What's NOT in here (dropped from earlier iterations)

  • env_router on_response asyncio.gather batching (was part of 445f3797, reverted in ed5d78fb) — drained all ready response frames first, then dispatched concurrently. Real wallclock cost for unclear value once the per-rollout offloads landed.

How to validate

Pair with the prime-rl PR and watch the env_worker Lag: line for max=. With these changes the typical max env-worker lag should stay under a few hundred ms even under 256 concurrent rollouts; before this stack we were seeing 6 s+ peaks.

Note

Reduce event-loop lag in serve workers by offloading CPU-bound work to threads

  • Offloads state_to_output calls in Environment.run_rollout and Environment.run_group to threads via asyncio.to_thread, with group states processed concurrently using asyncio.gather.
  • Offloads msgpack unpacking and Pydantic model_validate in the EnvWorker request handler, and token parsing in parse_response_tokens, to the threadpool.
  • Scales the default threadpool executor to concurrency=512 in both EnvWorker and EnvServer, and installs uvloop when available.

Changes since #1453 opened

  • Added pytest test validating parallel child rollouts sharing borrowed tools [73e4681]
  • Added update and reward functions spawning child rollouts with borrowed tools [73e4681]
  • Implemented deterministic test double model client routing responses by prompt content [73e4681]
📊 Macroscope summarized 445f379. 5 files reviewed, 1 issue evaluated, 0 issues filtered, 1 comment posted

🗂️ Filtered Issues


Note

Medium Risk
Touches the env server/worker request path and rollout output/token parsing, introducing more asyncio.to_thread usage and larger thread pools; risk is mainly around concurrency behavior, CPU/memory overhead, and subtle ordering/test assumptions.

Overview
Reduces env server/worker event-loop blocking by moving CPU-heavy work off the loop: state_to_output in Environment.run_rollout/run_group, token parsing in parse_response_tokens, and request deserialization (msgpack.unpackb + Pydantic model_validate) in EnvWorker now run via asyncio.to_thread.

Improves runtime scheduling by optionally installing uvloop in EnvServer.run_server/EnvWorker.run_worker and scaling the default executor to handle high to_thread concurrency (set to 512). Updates a lifecycle test to be robust to parallel asyncio.gather interleaving by using a routed mock client and removing order-dependent assertions.

Reviewed by Cursor Bugbot for commit 73e4681. Bugbot is set up for automated code reviews on this repo. Configure here.

codex and others added 5 commits May 24, 2026 04:40
parse_response_tokens is per-turn × per-rollout (up to ~100 × ~256 per
worker concurrently). The body is pure-Python list slicing + dict build
but accumulates real lag on the event loop under heavy concurrency.

Split into a sync `_parse_response_tokens_sync` and keep the async
wrapper that now delegates via asyncio.to_thread. Call sites unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
state_to_output walks the full trajectory dict + sanitizes messages per
rollout — can run 100s of ms on long rollouts. Moved off the loop in
both run_rollout and run_group so the worker stays responsive when many
rollouts finish near simultaneously.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both processes run a single asyncio loop juggling many concurrent
rollouts / many ZMQ I/O ops; the default selector loop becomes the
latency floor before any single operation does. uvloop drops the floor
materially.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The env_worker / env_server runtime imports are wrapped in try/except,
so declaring uvloop here would only cause uv.lock churn and trigger
per-worker .venv resync on shared filesystems. uvloop is already part
of the prime-rl dependency tree and resolved by the workspace.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…h router sends

Round 1 wrapped parse_response_tokens + state_to_output in to_thread but env
worker / server lag barely moved. Root cause: the default asyncio thread pool
is min(32, cpu_count+4) which is 32 here. With ~256 concurrent rollouts per
worker, those to_thread calls serialize across 32 threads — queuing eats most
of the win.

R1 (env_worker.py, env_server.py): call scale_executors(concurrency=512) before
install_default_executor() so the loop's default executor has real headroom.
512 threads are cheap when idle and give 2x the peak concurrency.

R2 (env_worker.py): wrap incoming-request msgpack.unpackb + Pydantic
model_validate in asyncio.to_thread. These ran on the loop and Pydantic
validation of RunRolloutRequest is non-trivial when many requests land at
once.

R4 (env_router.py): drain all ready response frames first, then dispatch
on_response via asyncio.gather instead of serial await. The prior loop awaited
a TCP send per response which compounds when many workers finish near
simultaneously.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread verifiers/serve/server/env_router.py Outdated
mikasenghaas and others added 2 commits May 23, 2026 23:35
Reverts the asyncio.gather batching of on_response dispatches in
EnvRouter.run, restoring the serial `await on_response(...)` per
response. R1 (scale_executors) and R2 (msgpack/Pydantic to_thread)
from 445f379 are kept.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…body

- Strip R1/R2/E1/E4 intervention IDs from comments in env_server.py,
  env_worker.py, response_utils.py. Reword the remaining comments
  to stand on their own without referencing the rollout plan.
- Move `_parse_response_tokens_sync` inside `parse_response_tokens`
  as a nested closure. Same to_thread offload, no module-level helper
  exposed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mikasenghaas mikasenghaas changed the title perf(serve): reduce env_worker/env_router event-loop lag perf(serve): reduce worker/router event-loop lag May 23, 2026
verifiers' env_worker / env_server install uvloop at startup; declaring
it as a direct dep makes that intent explicit instead of relying on
prime-rl's workspace tree to transitively pull it in. No uv.lock churn —
uvloop was already resolved via the workspace.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread pyproject.toml Outdated
mikasenghaas and others added 2 commits May 23, 2026 23:40
- Drop unused `from typing import Any` in response_utils.py.
- Reformat split imports and blank lines per ruff.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
uvloop ships no wheels for Windows / Cygwin and its C extension
fails to compile there; PyPy is similarly unsupported. Restrict the
direct dep to platforms uvloop actually targets — the runtime
import is already wrapped in try/except, so a missing uvloop is
a graceful degradation, not a crash.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mikasenghaas added a commit to PrimeIntellect-ai/prime-rl that referenced this pull request May 23, 2026
Pulls in the env_worker / env_router event-loop lag fixes from
PrimeIntellect-ai/verifiers#1453. Required for the orch-side gains in
this PR to be visible under realistic 256-rollout concurrency.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The to_thread wrapper introduced a mid-rollout event-loop yield that
broke test_update_and_reward_children_can_share_borrowed_live_tools.
The test gathers parallel update children + a reward child against a
CapturingModelClient that hands out canned responses in arrival order;
the extra yield reshuffled which coroutine consumed which frame and
the reward child landed on the wrong response.

The other lag fixes in this PR (uvloop, scale_executors=512, msgpack
unpack to_thread, state_to_output to_thread) don't touch per-request
scheduling, so the env_worker lag improvement holds without this piece.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mikasenghaas added a commit to PrimeIntellect-ai/prime-rl that referenced this pull request May 24, 2026
Pulls in the env_worker / env_router event-loop lag fixes from
PrimeIntellect-ai/verifiers#1453. Required for the orch-side gains in
this PR to be visible under realistic 256-rollout concurrency.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mikasenghaas added a commit to PrimeIntellect-ai/prime-rl that referenced this pull request May 24, 2026
Pulls in the env_worker / env_router event-loop lag fixes from
PrimeIntellect-ai/verifiers#1453. Required for the orch-side gains in
this PR to be visible under realistic 256-rollout concurrency.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mikasenghaas added a commit to PrimeIntellect-ai/prime-rl that referenced this pull request May 24, 2026
Pulls in the env_worker / env_router event-loop lag fixes from
PrimeIntellect-ai/verifiers#1453. Required for the orch-side gains in
this PR to be visible under realistic 256-rollout concurrency.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The parse is pure-Python list slicing + dict build, but it runs per-turn
for every concurrent rollout (~100 turns × ~256 rollouts/worker), so
wallclock contention on the event loop adds up. Offload the body to a
worker thread via asyncio.to_thread so the loop stays responsive for
ZMQ I/O and other tasks.

Drop test_update_and_reward_children_can_share_borrowed_live_tools.
The test exercised parallel children + a reward child against a
CapturingModelClient that hands out canned responses in arrival order;
the to_thread yield reshuffled which coroutine consumed which frame.
The actual borrowed-tools-sharing contract is still covered by other
tests in the same file — only the order-sensitive variant is gone.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mikasenghaas added a commit to PrimeIntellect-ai/prime-rl that referenced this pull request May 24, 2026
Pulls in the env_worker / env_router event-loop lag fixes from
PrimeIntellect-ai/verifiers#1453. Required for the orch-side gains in
this PR to be visible under realistic 256-rollout concurrency.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mikasenghaas added a commit to PrimeIntellect-ai/prime-rl that referenced this pull request May 24, 2026
Pulls in the env_worker / env_router event-loop lag fixes from
PrimeIntellect-ai/verifiers#1453. Required for the orch-side gains in
this PR to be visible under realistic 256-rollout concurrency.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop the inline rationale paragraphs around state_to_output to_thread,
msgpack/Pydantic unpack offload, uvloop install, and parse_response_tokens
offload. Code is self-explanatory; commit messages carry the why.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mikasenghaas added a commit to PrimeIntellect-ai/prime-rl that referenced this pull request May 24, 2026
Pulls in the env_worker / env_router event-loop lag fixes from
PrimeIntellect-ai/verifiers#1453. Required for the orch-side gains in
this PR to be visible under realistic 256-rollout concurrency.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mikasenghaas added a commit to PrimeIntellect-ai/prime-rl that referenced this pull request May 24, 2026
Pulls in the env_worker / env_router event-loop lag fixes from
PrimeIntellect-ai/verifiers#1453. Required for the orch-side gains in
this PR to be visible under realistic 256-rollout concurrency.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mikasenghaas mikasenghaas requested a review from samsja May 24, 2026 00:41
@mikasenghaas mikasenghaas marked this pull request as ready for review May 24, 2026 00:41
@macroscopeapp
Copy link
Copy Markdown

macroscopeapp Bot commented May 24, 2026

Approvability

Verdict: Approved

Performance optimization that offloads CPU-bound parsing to thread pools and uses uvloop for faster async event loops. The functional logic remains unchanged - only execution scheduling is affected. Standard async patterns applied consistently across the serve/worker subsystem.

You can customize Macroscope's approvability policy. Learn more.

mikasenghaas added a commit to PrimeIntellect-ai/prime-rl that referenced this pull request May 24, 2026
Pulls in the env_worker / env_router event-loop lag fixes from
PrimeIntellect-ai/verifiers#1453. Required for the orch-side gains in
this PR to be visible under realistic 256-rollout concurrency.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mikasenghaas added a commit to PrimeIntellect-ai/prime-rl that referenced this pull request May 24, 2026
Pulls in the env_worker / env_router event-loop lag fixes from
PrimeIntellect-ai/verifiers#1453. Required for the orch-side gains in
this PR to be visible under realistic 256-rollout concurrency.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread tests/test_v1_runtime_lifecycle.py
S1ro1
S1ro1 previously approved these changes May 24, 2026
Copy link
Copy Markdown
Contributor

@S1ro1 S1ro1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

… mock

The previous test_update_and_reward_children_can_share_borrowed_live_tools
was dropped when the parse_response_tokens to_thread offload was re-added,
because it depended on a sequential canned-response list. With parallel
asyncio.gather children, the to_thread yield reshuffled which coroutine
consumed which response — a test-fixture artifact, since real model clients
return responses keyed to each request.

Rebuild the test with RoutedModelClient: it inspects each request's
conversation (first user message + presence of tool-role messages) and
returns the appropriate canned response, independent of call order.
Resulting test exercises the same contract (parallel update children +
reward child all share the borrowed tool, all three values land in state)
but is robust to event-loop interleaving.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mikasenghaas added a commit to PrimeIntellect-ai/prime-rl that referenced this pull request May 24, 2026
Pulls in the env_worker / env_router event-loop lag fixes from
PrimeIntellect-ai/verifiers#1453. Required for the orch-side gains in
this PR to be visible under realistic 256-rollout concurrency.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mikasenghaas mikasenghaas merged commit a930f2e into main May 24, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants