perf(orchestrator): reduce event-loop lag by mikasenghaas · Pull Request #2609 · PrimeIntellect-ai/prime-rl

mikasenghaas · 2026-05-23T23:41:43Z

Summary

Reduce orchestrator step-boundary event-loop lag, motivated by router-replay runs (Qwen3-30B-A3B + GLM-5.1) where the per-step routed_experts payload (~25 GB across 256 rollouts) caused multi-second loop-lag spikes at step boundaries.

Changes are all in the orchestrator and its post-rollout pipeline. The trainer/inference paths are untouched.

Coupled with PrimeIntellect-ai/verifiers#1453 (perf/r3-serve). The verifiers PR fixes the env-side counterpart of the same lag (env_worker / env_router), which this PR's verifiers submodule bump pulls in. Land or revert together — without the verifiers side the env worker loop saturates first under 256-rollout concurrency and the orch-side gains here don't surface.

Follow-up #2612 (perf/r3-sidecar) layers a routed_experts sidecar-split sender on top of this PR. Use it if the to_thread-only send here isn't enough for the largest router-replay payloads.

What's in here

7 commits, ~200 LOC. Each commit is a single intervention:

#	Commit	What
1	`perf(transport): offload TrainingBatch send to a worker thread`	Make `TrainingBatchSender.send` async; encode + disk write run via `asyncio.to_thread` so the orch event loop stays responsive during step transitions.
2	`perf(orchestrator): await async sender; to_thread advantages+filters+uvloop`	Await the now-async sender. Wrap `compute_advantages` and `apply_filters` in `asyncio.to_thread` — pure-Python work that was running on the loop. Install uvloop in `main()`.
3	`perf(orchestrator): log p99 event-loop lag in metrics + busy-loop warning`	Add `event_loop_lag/p99` to the wandb dict and the busy-loop warning line. Raise warn thresholds to p90>1s / p99>5s / max>30s — p90 alone undersold the long tail we actually care about.
4	`perf(orchestrator): defer routed_experts concat in interleave_rollout`	Per-extension `unpack → numpy.concatenate → repack` was O(N²) byte copies (~2.3 GB memcpy per long rollout). New path keeps a per-sample chunk list and concatenates once at finalize. O(N) byte work, ~100× less per rollout.
5	`perf(orchestrator): orjson save_rollouts + dedup bool conversion`	`json.dump` → `orjson.dumps`. Pure-Python serializer held the GIL for the whole save (multi-second spikes on big steps); orjson serializes in C and releases the GIL. Also dropped the redundant `[bool(i) for i in mask]` re-conversion in `make_sample`/`extend_sample` (use `list(...)` instead) and switched `prepare_step_tokens` to `list(map(bool, ...))`.
6	`perf(orchestrator): skip pretokenize fanout when nothing needs work; 10Hz lag monitor`	Pretokenize is a no-op for router-replay (every step has `tokens` populated), but the 256-way `to_thread` fanout was firing only to no-op and the GIL stampede blocked the loop. Skip when no rollout needs work. Also `EventLoopLagMonitor` sample rate 1 Hz → 10 Hz to catch sub-second spikes.
7	`chore(deps): bump verifiers submodule to PR 1453 head`	Pulls in the env_worker / env_router lag fixes from verifiers#1453.

What's NOT in here (deferred / dropped)

Streaming + routed_experts sidecar in the sender — moved to follow-up perf(transport): stream-sidecar TrainingBatch sender (routed_experts split) #2612. Real win if the simpler to_thread send here is insufficient under heavy router replay, but adds an on-disk format change and per-sample loop yields.
Shared ProcessPoolExecutor — tried for router-replay's GIL residual, didn't help (per-call pickle/IPC cost dominated the small per-rollout payload). Dropped.
Per-phase timing instrumentation + dump_raw_rollouts — useful while investigating but not worth keeping in prod.
Chunked gather (gather_chunk_size) — sliced interleave_rollout waves with await sleep(0) between. Real wallclock cost for unclear value once the algorithmic fixes (Data streaming #4) landed. Reverted.
Killing base64 on the routed_experts wire — profile shows pybase64.b64decode_as_bytearray is ~73% of single-thread CPU in interleave_rollout. Releases the GIL so it parallelises and doesn't block the loop directly, but it's the dominant wallclock floor. Moving inference→orch transport to msgpack bin would eliminate the encode+decode. Cross-repo, deferred.
TrainingSample.{prompt,completion}_mask: list[bool] → list[int] or bytes — would kill the last Python loop in prepare_step_tokens. Interface change, deferred until empirical evidence it matters.

Note

Medium Risk
Touches the orchestrator post-rollout pipeline and transport TrainingBatchSender API, so regressions could impact training throughput or batch delivery ordering, though changes are largely performance-oriented and isolated from model/inference logic.

Overview
Reduces orchestrator event-loop stalls during step transitions by pushing CPU/GIL-heavy work off the loop and cutting per-rollout copying overhead.

compute_advantages, rollout filtering, and filesystem batch encoding/writes now run via asyncio.to_thread, and TrainingBatchSender.send becomes async (with the orchestrator awaiting it). Pretokenization fanout is skipped when all steps already have tokens, and uvloop is installed in main().

interleave_rollout now defers routed_experts concatenation/packing until finalization (chunking per step and concatenating once), and rollout JSONL saving switches to orjson. Event-loop lag monitoring is sampled at 10Hz and logs/exports p99 lag with updated warning thresholds.

^{Reviewed by Cursor Bugbot for commit b92f7d3. Bugbot is set up for automated code reviews on this repo. Configure here.}

S1ro1

love

Make FileSystemTrainingBatchSender.send async and run the msgspec encode + disk write in asyncio.to_thread. Keeps the orch event loop responsive during step transitions when the batch payload is large. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…uvloop - await training_batch_sender.send(training_batch) since sender is async (O1+O2) - compute_advantages and apply_filters wrapped in asyncio.to_thread (O3+O4): both iterate ~2k rollouts in pure Python and release the GIL on every bytecode tick, so threading actually helps unlike the C-extension encode - main() installs uvloop (O7) — lower scheduler overhead matters when the orchestrator is juggling many concurrent rollouts + the HTTP client to inference; measurably reduces tail latency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ning p90 alone undersells the long tail we see at step boundaries. p99 is the metric we actually compare across runs, so surface it in the wandb dict (`event_loop_lag/p99`) and in the busy-loop warning line. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

extend_sample's previous implementation unpacked the accumulated routed_experts bytes, mutated one boundary entry, np.concatenated the new step's routings, then re-packed to bytes — every extension. For an N-turn rollout reaching seq_len 60k, the accumulated buffer reached ~23 MB and was fully read+written N times, giving O(N²) byte work (~2.3 GB of memcpy per rollout). At 256 rollouts/step this dominated the orch's step-boundary event-loop stalls more than the encode path we'd already fixed. Defer the concat: track per-sample `chunks: list[np.ndarray]` during the extension loop and finalize once at the end. The "boundary token" entry that vLLM omits for each request is appended as its own one-entry chunk between consecutive steps' contributions, so the final concat is a straight join — no destructive writes to prior chunks. Verified byte-exact against the old implementation on a 50-step extension chain. Per-rollout work drops from O(N²) → O(N): ~100× less memcpy in the realistic regime. This makes use_process_pool unnecessary for the residual orch lag: each rollout's process_rollout cost drops below the IPC pickle cost, so the threaded path is strictly better. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- save_rollouts: json.dump -> orjson.dumps + bytes write. The pure-Python serializer held the GIL for the whole save phase (multi-second spikes on big steps); orjson serializes in C and releases the GIL. - prepare/make/extend mask handling: [bool(i) for i in mask] was the remaining per-turn GIL-held Python loop. make_sample/extend_sample swapped to list(...) / direct extend; prepare_step_tokens uses list(map(bool, ...)). Together drops gather wallclock 5-10% on long-tail rollouts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…10Hz lag monitor - Pretokenize was a no-op for router-replay (every step already has tokens populated by the renderer client), but the 256-way to_thread fanout starved the loop for ~2 s per step. Short-circuit when no rollout's trajectory has unpopulated tokens. Offline attribution: pretokenize lag drops 2050 ms → 11 ms. - EventLoopLagMonitor sampling 1 Hz → 10 Hz (mirrors verifiers' monitor). 1 Hz was missing sub-second spikes that show up clearly at 10 Hz. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Pulls in the env_worker / env_router event-loop lag fixes that just landed on verifiers main via PR 1453 — required for the orch-side gains in this PR to surface under realistic 256-rollout concurrency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

This was referenced May 23, 2026

perf(orchestrator): reduce event-loop lag #2608

Closed

perf(serve): reduce worker/router event-loop lag PrimeIntellect-ai/verifiers#1453

Merged

mikasenghaas force-pushed the perf/r3-rewrite branch 6 times, most recently from f502ac4 to fffbd8f Compare May 24, 2026 00:32

mikasenghaas changed the title ~~perf(orchestrator): reduce step-boundary event-loop lag~~ perf(orchestrator): reduce event-loop lag May 24, 2026

mikasenghaas force-pushed the perf/r3-rewrite branch 3 times, most recently from 334af2b to 3aeb519 Compare May 24, 2026 00:51

mikasenghaas mentioned this pull request May 24, 2026

perf(transport): stream-sidecar TrainingBatch sender (routed_experts split) #2612

Draft

samsja reviewed May 24, 2026

View reviewed changes

Comment thread src/prime_rl/orchestrator/orchestrator.py Outdated

mikasenghaas force-pushed the perf/r3-rewrite branch from 3aeb519 to 65fe273 Compare May 24, 2026 00:56

S1ro1 previously approved these changes May 24, 2026

View reviewed changes

Comment thread src/prime_rl/transport/filesystem.py

Comment thread src/prime_rl/orchestrator/vf_utils.py

mikasenghaas dismissed S1ro1’s stale review via dbb5a6c May 24, 2026 01:06

mikasenghaas force-pushed the perf/r3-rewrite branch 2 times, most recently from dbb5a6c to 871a69c Compare May 24, 2026 01:12

samsja and others added 7 commits May 24, 2026 01:14

mikasenghaas force-pushed the perf/r3-rewrite branch from 871a69c to b92f7d3 Compare May 24, 2026 01:14

mikasenghaas marked this pull request as ready for review May 24, 2026 01:15

mikasenghaas requested a review from samsja May 24, 2026 01:15

S1ro1 approved these changes May 24, 2026

View reviewed changes

samsja approved these changes May 24, 2026

View reviewed changes

mikasenghaas merged commit 8e97806 into main May 24, 2026
22 checks passed

mikasenghaas mentioned this pull request May 24, 2026

chore(configs): router replay perf repro #2614

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(orchestrator): reduce event-loop lag#2609

perf(orchestrator): reduce event-loop lag#2609
mikasenghaas merged 7 commits into
mainfrom
perf/r3-rewrite

mikasenghaas commented May 23, 2026 •

edited by cursor Bot

Loading

Uh oh!

Uh oh!

S1ro1 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mikasenghaas commented May 23, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's in here

What's NOT in here (deferred / dropped)

Uh oh!

Uh oh!

S1ro1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mikasenghaas commented May 23, 2026 •

edited by cursor Bot

Loading