perf(orchestrator): reduce event-loop lag#2609
Merged
Merged
Conversation
This was referenced May 23, 2026
f502ac4 to
fffbd8f
Compare
334af2b to
3aeb519
Compare
samsja
reviewed
May 24, 2026
3aeb519 to
65fe273
Compare
S1ro1
previously approved these changes
May 24, 2026
dbb5a6c to
871a69c
Compare
Make FileSystemTrainingBatchSender.send async and run the msgspec encode + disk write in asyncio.to_thread. Keeps the orch event loop responsive during step transitions when the batch payload is large. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…uvloop - await training_batch_sender.send(training_batch) since sender is async (O1+O2) - compute_advantages and apply_filters wrapped in asyncio.to_thread (O3+O4): both iterate ~2k rollouts in pure Python and release the GIL on every bytecode tick, so threading actually helps unlike the C-extension encode - main() installs uvloop (O7) — lower scheduler overhead matters when the orchestrator is juggling many concurrent rollouts + the HTTP client to inference; measurably reduces tail latency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ning p90 alone undersells the long tail we see at step boundaries. p99 is the metric we actually compare across runs, so surface it in the wandb dict (`event_loop_lag/p99`) and in the busy-loop warning line. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
extend_sample's previous implementation unpacked the accumulated routed_experts bytes, mutated one boundary entry, np.concatenated the new step's routings, then re-packed to bytes — every extension. For an N-turn rollout reaching seq_len 60k, the accumulated buffer reached ~23 MB and was fully read+written N times, giving O(N²) byte work (~2.3 GB of memcpy per rollout). At 256 rollouts/step this dominated the orch's step-boundary event-loop stalls more than the encode path we'd already fixed. Defer the concat: track per-sample `chunks: list[np.ndarray]` during the extension loop and finalize once at the end. The "boundary token" entry that vLLM omits for each request is appended as its own one-entry chunk between consecutive steps' contributions, so the final concat is a straight join — no destructive writes to prior chunks. Verified byte-exact against the old implementation on a 50-step extension chain. Per-rollout work drops from O(N²) → O(N): ~100× less memcpy in the realistic regime. This makes use_process_pool unnecessary for the residual orch lag: each rollout's process_rollout cost drops below the IPC pickle cost, so the threaded path is strictly better. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- save_rollouts: json.dump -> orjson.dumps + bytes write. The pure-Python serializer held the GIL for the whole save phase (multi-second spikes on big steps); orjson serializes in C and releases the GIL. - prepare/make/extend mask handling: [bool(i) for i in mask] was the remaining per-turn GIL-held Python loop. make_sample/extend_sample swapped to list(...) / direct extend; prepare_step_tokens uses list(map(bool, ...)). Together drops gather wallclock 5-10% on long-tail rollouts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…10Hz lag monitor - Pretokenize was a no-op for router-replay (every step already has tokens populated by the renderer client), but the 256-way to_thread fanout starved the loop for ~2 s per step. Short-circuit when no rollout's trajectory has unpopulated tokens. Offline attribution: pretokenize lag drops 2050 ms → 11 ms. - EventLoopLagMonitor sampling 1 Hz → 10 Hz (mirrors verifiers' monitor). 1 Hz was missing sub-second spikes that show up clearly at 10 Hz. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pulls in the env_worker / env_router event-loop lag fixes that just landed on verifiers main via PR 1453 — required for the orch-side gains in this PR to surface under realistic 256-rollout concurrency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
871a69c to
b92f7d3
Compare
S1ro1
approved these changes
May 24, 2026
samsja
approved these changes
May 24, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Reduce orchestrator step-boundary event-loop lag, motivated by router-replay runs (Qwen3-30B-A3B + GLM-5.1) where the per-step routed_experts payload (~25 GB across 256 rollouts) caused multi-second loop-lag spikes at step boundaries.
Changes are all in the orchestrator and its post-rollout pipeline. The trainer/inference paths are untouched.
What's in here
7 commits, ~200 LOC. Each commit is a single intervention:
perf(transport): offload TrainingBatch send to a worker threadTrainingBatchSender.sendasync; encode + disk write run viaasyncio.to_threadso the orch event loop stays responsive during step transitions.perf(orchestrator): await async sender; to_thread advantages+filters+uvloopcompute_advantagesandapply_filtersinasyncio.to_thread— pure-Python work that was running on the loop. Install uvloop inmain().perf(orchestrator): log p99 event-loop lag in metrics + busy-loop warningevent_loop_lag/p99to the wandb dict and the busy-loop warning line. Raise warn thresholds to p90>1s / p99>5s / max>30s — p90 alone undersold the long tail we actually care about.perf(orchestrator): defer routed_experts concat in interleave_rolloutunpack → numpy.concatenate → repackwas O(N²) byte copies (~2.3 GB memcpy per long rollout). New path keeps a per-sample chunk list and concatenates once at finalize. O(N) byte work, ~100× less per rollout.perf(orchestrator): orjson save_rollouts + dedup bool conversionjson.dump→orjson.dumps. Pure-Python serializer held the GIL for the whole save (multi-second spikes on big steps); orjson serializes in C and releases the GIL. Also dropped the redundant[bool(i) for i in mask]re-conversion inmake_sample/extend_sample(uselist(...)instead) and switchedprepare_step_tokenstolist(map(bool, ...)).perf(orchestrator): skip pretokenize fanout when nothing needs work; 10Hz lag monitortokenspopulated), but the 256-wayto_threadfanout was firing only to no-op and the GIL stampede blocked the loop. Skip when no rollout needs work. AlsoEventLoopLagMonitorsample rate 1 Hz → 10 Hz to catch sub-second spikes.chore(deps): bump verifiers submodule to PR 1453 headWhat's NOT in here (deferred / dropped)
ProcessPoolExecutor— tried for router-replay's GIL residual, didn't help (per-call pickle/IPC cost dominated the small per-rollout payload). Dropped.dump_raw_rollouts— useful while investigating but not worth keeping in prod.gather_chunk_size) — slicedinterleave_rolloutwaves withawait sleep(0)between. Real wallclock cost for unclear value once the algorithmic fixes (Data streaming #4) landed. Reverted.pybase64.b64decode_as_bytearrayis ~73% of single-thread CPU ininterleave_rollout. Releases the GIL so it parallelises and doesn't block the loop directly, but it's the dominant wallclock floor. Moving inference→orch transport to msgpackbinwould eliminate the encode+decode. Cross-repo, deferred.TrainingSample.{prompt,completion}_mask:list[bool]→list[int]orbytes— would kill the last Python loop inprepare_step_tokens. Interface change, deferred until empirical evidence it matters.Note
Medium Risk
Touches the orchestrator post-rollout pipeline and transport
TrainingBatchSenderAPI, so regressions could impact training throughput or batch delivery ordering, though changes are largely performance-oriented and isolated from model/inference logic.Overview
Reduces orchestrator event-loop stalls during step transitions by pushing CPU/GIL-heavy work off the loop and cutting per-rollout copying overhead.
compute_advantages, rollout filtering, and filesystem batch encoding/writes now run viaasyncio.to_thread, andTrainingBatchSender.sendbecomesasync(with the orchestrator awaiting it). Pretokenization fanout is skipped when all steps already have tokens, anduvloopis installed inmain().interleave_rolloutnow defersrouted_expertsconcatenation/packing until finalization (chunking per step and concatenating once), and rollout JSONL saving switches toorjson. Event-loop lag monitoring is sampled at 10Hz and logs/exportsp99lag with updated warning thresholds.Reviewed by Cursor Bugbot for commit b92f7d3. Bugbot is set up for automated code reviews on this repo. Configure here.