Skip to content

perf(orchestrator): reduce event-loop lag#2609

Merged
mikasenghaas merged 7 commits into
mainfrom
perf/r3-rewrite
May 24, 2026
Merged

perf(orchestrator): reduce event-loop lag#2609
mikasenghaas merged 7 commits into
mainfrom
perf/r3-rewrite

Conversation

@mikasenghaas
Copy link
Copy Markdown
Member

@mikasenghaas mikasenghaas commented May 23, 2026

Summary

Reduce orchestrator step-boundary event-loop lag, motivated by router-replay runs (Qwen3-30B-A3B + GLM-5.1) where the per-step routed_experts payload (~25 GB across 256 rollouts) caused multi-second loop-lag spikes at step boundaries.

Changes are all in the orchestrator and its post-rollout pipeline. The trainer/inference paths are untouched.

Coupled with PrimeIntellect-ai/verifiers#1453 (perf/r3-serve). The verifiers PR fixes the env-side counterpart of the same lag (env_worker / env_router), which this PR's verifiers submodule bump pulls in. Land or revert together — without the verifiers side the env worker loop saturates first under 256-rollout concurrency and the orch-side gains here don't surface.

Follow-up #2612 (perf/r3-sidecar) layers a routed_experts sidecar-split sender on top of this PR. Use it if the to_thread-only send here isn't enough for the largest router-replay payloads.

What's in here

7 commits, ~200 LOC. Each commit is a single intervention:

# Commit What
1 perf(transport): offload TrainingBatch send to a worker thread Make TrainingBatchSender.send async; encode + disk write run via asyncio.to_thread so the orch event loop stays responsive during step transitions.
2 perf(orchestrator): await async sender; to_thread advantages+filters+uvloop Await the now-async sender. Wrap compute_advantages and apply_filters in asyncio.to_thread — pure-Python work that was running on the loop. Install uvloop in main().
3 perf(orchestrator): log p99 event-loop lag in metrics + busy-loop warning Add event_loop_lag/p99 to the wandb dict and the busy-loop warning line. Raise warn thresholds to p90>1s / p99>5s / max>30s — p90 alone undersold the long tail we actually care about.
4 perf(orchestrator): defer routed_experts concat in interleave_rollout Per-extension unpack → numpy.concatenate → repack was O(N²) byte copies (~2.3 GB memcpy per long rollout). New path keeps a per-sample chunk list and concatenates once at finalize. O(N) byte work, ~100× less per rollout.
5 perf(orchestrator): orjson save_rollouts + dedup bool conversion json.dumporjson.dumps. Pure-Python serializer held the GIL for the whole save (multi-second spikes on big steps); orjson serializes in C and releases the GIL. Also dropped the redundant [bool(i) for i in mask] re-conversion in make_sample/extend_sample (use list(...) instead) and switched prepare_step_tokens to list(map(bool, ...)).
6 perf(orchestrator): skip pretokenize fanout when nothing needs work; 10Hz lag monitor Pretokenize is a no-op for router-replay (every step has tokens populated), but the 256-way to_thread fanout was firing only to no-op and the GIL stampede blocked the loop. Skip when no rollout needs work. Also EventLoopLagMonitor sample rate 1 Hz → 10 Hz to catch sub-second spikes.
7 chore(deps): bump verifiers submodule to PR 1453 head Pulls in the env_worker / env_router lag fixes from verifiers#1453.

What's NOT in here (deferred / dropped)

  • Streaming + routed_experts sidecar in the sender — moved to follow-up perf(transport): stream-sidecar TrainingBatch sender (routed_experts split) #2612. Real win if the simpler to_thread send here is insufficient under heavy router replay, but adds an on-disk format change and per-sample loop yields.
  • Shared ProcessPoolExecutor — tried for router-replay's GIL residual, didn't help (per-call pickle/IPC cost dominated the small per-rollout payload). Dropped.
  • Per-phase timing instrumentation + dump_raw_rollouts — useful while investigating but not worth keeping in prod.
  • Chunked gather (gather_chunk_size) — sliced interleave_rollout waves with await sleep(0) between. Real wallclock cost for unclear value once the algorithmic fixes (Data streaming #4) landed. Reverted.
  • Killing base64 on the routed_experts wire — profile shows pybase64.b64decode_as_bytearray is ~73% of single-thread CPU in interleave_rollout. Releases the GIL so it parallelises and doesn't block the loop directly, but it's the dominant wallclock floor. Moving inference→orch transport to msgpack bin would eliminate the encode+decode. Cross-repo, deferred.
  • TrainingSample.{prompt,completion}_mask: list[bool]list[int] or bytes — would kill the last Python loop in prepare_step_tokens. Interface change, deferred until empirical evidence it matters.

Note

Medium Risk
Touches the orchestrator post-rollout pipeline and transport TrainingBatchSender API, so regressions could impact training throughput or batch delivery ordering, though changes are largely performance-oriented and isolated from model/inference logic.

Overview
Reduces orchestrator event-loop stalls during step transitions by pushing CPU/GIL-heavy work off the loop and cutting per-rollout copying overhead.

compute_advantages, rollout filtering, and filesystem batch encoding/writes now run via asyncio.to_thread, and TrainingBatchSender.send becomes async (with the orchestrator awaiting it). Pretokenization fanout is skipped when all steps already have tokens, and uvloop is installed in main().

interleave_rollout now defers routed_experts concatenation/packing until finalization (chunking per step and concatenating once), and rollout JSONL saving switches to orjson. Event-loop lag monitoring is sampled at 10Hz and logs/exports p99 lag with updated warning thresholds.

Reviewed by Cursor Bugbot for commit b92f7d3. Bugbot is set up for automated code reviews on this repo. Configure here.

@mikasenghaas mikasenghaas force-pushed the perf/r3-rewrite branch 6 times, most recently from f502ac4 to fffbd8f Compare May 24, 2026 00:32
@mikasenghaas mikasenghaas changed the title perf(orchestrator): reduce step-boundary event-loop lag perf(orchestrator): reduce event-loop lag May 24, 2026
@mikasenghaas mikasenghaas force-pushed the perf/r3-rewrite branch 3 times, most recently from 334af2b to 3aeb519 Compare May 24, 2026 00:51
Comment thread src/prime_rl/orchestrator/orchestrator.py Outdated
S1ro1
S1ro1 previously approved these changes May 24, 2026
Copy link
Copy Markdown
Collaborator

@S1ro1 S1ro1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

love

Comment thread src/prime_rl/transport/filesystem.py
Comment thread src/prime_rl/orchestrator/vf_utils.py
@mikasenghaas mikasenghaas force-pushed the perf/r3-rewrite branch 2 times, most recently from dbb5a6c to 871a69c Compare May 24, 2026 01:12
samsja and others added 7 commits May 24, 2026 01:14
Make FileSystemTrainingBatchSender.send async and run the msgspec encode
+ disk write in asyncio.to_thread. Keeps the orch event loop responsive
during step transitions when the batch payload is large.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…uvloop

- await training_batch_sender.send(training_batch) since sender is async (O1+O2)
- compute_advantages and apply_filters wrapped in asyncio.to_thread (O3+O4):
  both iterate ~2k rollouts in pure Python and release the GIL on every
  bytecode tick, so threading actually helps unlike the C-extension encode
- main() installs uvloop (O7) — lower scheduler overhead matters when the
  orchestrator is juggling many concurrent rollouts + the HTTP client to
  inference; measurably reduces tail latency.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ning

p90 alone undersells the long tail we see at step boundaries. p99 is the
metric we actually compare across runs, so surface it in the wandb dict
(`event_loop_lag/p99`) and in the busy-loop warning line.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
extend_sample's previous implementation unpacked the accumulated
routed_experts bytes, mutated one boundary entry, np.concatenated the new
step's routings, then re-packed to bytes — every extension. For an N-turn
rollout reaching seq_len 60k, the accumulated buffer reached ~23 MB and was
fully read+written N times, giving O(N²) byte work (~2.3 GB of memcpy per
rollout). At 256 rollouts/step this dominated the orch's step-boundary
event-loop stalls more than the encode path we'd already fixed.

Defer the concat: track per-sample `chunks: list[np.ndarray]` during the
extension loop and finalize once at the end. The "boundary token" entry
that vLLM omits for each request is appended as its own one-entry chunk
between consecutive steps' contributions, so the final concat is a
straight join — no destructive writes to prior chunks.

Verified byte-exact against the old implementation on a 50-step extension
chain. Per-rollout work drops from O(N²) → O(N): ~100× less memcpy in the
realistic regime.

This makes use_process_pool unnecessary for the residual orch lag: each
rollout's process_rollout cost drops below the IPC pickle cost, so the
threaded path is strictly better.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- save_rollouts: json.dump -> orjson.dumps + bytes write. The pure-Python
  serializer held the GIL for the whole save phase (multi-second spikes
  on big steps); orjson serializes in C and releases the GIL.
- prepare/make/extend mask handling: [bool(i) for i in mask] was the
  remaining per-turn GIL-held Python loop. make_sample/extend_sample
  swapped to list(...) / direct extend; prepare_step_tokens uses
  list(map(bool, ...)). Together drops gather wallclock 5-10% on
  long-tail rollouts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…10Hz lag monitor

- Pretokenize was a no-op for router-replay (every step already has
  tokens populated by the renderer client), but the 256-way to_thread
  fanout starved the loop for ~2 s per step. Short-circuit when no
  rollout's trajectory has unpopulated tokens. Offline attribution:
  pretokenize lag drops 2050 ms → 11 ms.
- EventLoopLagMonitor sampling 1 Hz → 10 Hz (mirrors verifiers'
  monitor). 1 Hz was missing sub-second spikes that show up clearly
  at 10 Hz.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pulls in the env_worker / env_router event-loop lag fixes that just
landed on verifiers main via PR 1453 — required for the orch-side
gains in this PR to surface under realistic 256-rollout concurrency.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mikasenghaas mikasenghaas marked this pull request as ready for review May 24, 2026 01:15
@mikasenghaas mikasenghaas requested a review from samsja May 24, 2026 01:15
@mikasenghaas mikasenghaas merged commit 8e97806 into main May 24, 2026
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants