perf(transport): stream-sidecar TrainingBatch sender (routed_experts split) by mikasenghaas · Pull Request #2612 · PrimeIntellect-ai/prime-rl

mikasenghaas · 2026-05-24T00:52:03Z

Summary

Draft follow-up to #2609. Layered on top of perf/r3-rewrite — only the transport-side change is new. Merge or rebase this after #2609 lands; or just pick it up directly if the to_thread-only send in #2609 isn't sufficient for our largest router-replay payloads.

What's in here (one commit)

perf(transport): stream-sidecar TrainingBatch sender (routed_experts split)

Peels routed_experts.data out of the msgpack encode and writes it as a single raw-bytes sidecar file (train_rollouts.routed_experts.bin). The hot bytes (~85% of payload) skip the base64-style msgpack bin framing on encode and the corresponding allocation on decode.
Streams TrainingSample frames one at a time with await asyncio.sleep(0) between, so even the encode phase yields to the event loop instead of holding it through a single big encoder.encode(batch) call.
Sidecar must be visible before the main file — receiver watches the main file as the "batch is ready" signal.
v2 file format: 4-byte manifest_len + manifest + (4-byte frame_len + sample-frame) * n. Receiver reconstructs samples from the manifest's offsets/shapes/dtypes arrays.

When to land this

Use as a fallback if #2609 alone doesn't bring step-boundary lag low enough on router-replay runs (Qwen3-30B-A3B + GLM-5.1 territory, ~25 GB routed_experts payload). For non-router-replay workloads the simpler to_thread-only send in #2609 should be sufficient.

Risk

New on-disk format (v2). Backward-compat would require keeping both readers around; this PR deletes the v1 reader. Old run artifacts won't be readable by this branch. Acceptable for an experimental knob; not for general release until v1 is migrated away from.
Per-sample await sleep(0) is harmless but adds N loop ticks per batch (small CPU overhead, large lag reduction).

Make FileSystemTrainingBatchSender.send async and run the msgspec encode + disk write in asyncio.to_thread. Keeps the orch event loop responsive during step transitions when the batch payload is large. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…uvloop - await training_batch_sender.send(training_batch) since sender is async (O1+O2) - compute_advantages and apply_filters wrapped in asyncio.to_thread (O3+O4): both iterate ~2k rollouts in pure Python and release the GIL on every bytecode tick, so threading actually helps unlike the C-extension encode - main() installs uvloop (O7) — lower scheduler overhead matters when the orchestrator is juggling many concurrent rollouts + the HTTP client to inference; measurably reduces tail latency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ning p90 alone undersells the long tail we see at step boundaries. p99 is the metric we actually compare across runs, so surface it in the wandb dict (`event_loop_lag/p99`) and in the busy-loop warning line. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

extend_sample's previous implementation unpacked the accumulated routed_experts bytes, mutated one boundary entry, np.concatenated the new step's routings, then re-packed to bytes — every extension. For an N-turn rollout reaching seq_len 60k, the accumulated buffer reached ~23 MB and was fully read+written N times, giving O(N²) byte work (~2.3 GB of memcpy per rollout). At 256 rollouts/step this dominated the orch's step-boundary event-loop stalls more than the encode path we'd already fixed. Defer the concat: track per-sample `chunks: list[np.ndarray]` during the extension loop and finalize once at the end. The "boundary token" entry that vLLM omits for each request is appended as its own one-entry chunk between consecutive steps' contributions, so the final concat is a straight join — no destructive writes to prior chunks. Verified byte-exact against the old implementation on a 50-step extension chain. Per-rollout work drops from O(N²) → O(N): ~100× less memcpy in the realistic regime. This makes use_process_pool unnecessary for the residual orch lag: each rollout's process_rollout cost drops below the IPC pickle cost, so the threaded path is strictly better. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- save_rollouts: json.dump -> orjson.dumps + bytes write. The pure-Python serializer held the GIL for the whole save phase (multi-second spikes on big steps); orjson serializes in C and releases the GIL. - prepare/make/extend mask handling: [bool(i) for i in mask] was the remaining per-turn GIL-held Python loop. make_sample/extend_sample swapped to list(...) / direct extend; prepare_step_tokens uses list(map(bool, ...)). Together drops gather wallclock 5-10% on long-tail rollouts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…10Hz lag monitor - Pretokenize was a no-op for router-replay (every step already has tokens populated by the renderer client), but the 256-way to_thread fanout starved the loop for ~2 s per step. Short-circuit when no rollout's trajectory has unpopulated tokens. Offline attribution: pretokenize lag drops 2050 ms → 11 ms. - EventLoopLagMonitor sampling 1 Hz → 10 Hz (mirrors verifiers' monitor). 1 Hz was missing sub-second spikes that show up clearly at 10 Hz. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Pulls in the env_worker / env_router event-loop lag fixes from PrimeIntellect-ai/verifiers#1453. Required for the orch-side gains in this PR to be visible under realistic 256-rollout concurrency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…split) Layered follow-up on perf/r3-rewrite. Keep this for when the to_thread-only send isn't enough — splits routed_experts.data (~85% of payload) into a raw-bytes sidecar so the main msgpack encode no longer pays the base64/copy cost on the hot bytes. - Streams TrainingSample frames per-sample with `await sleep(0)` between, keeping the loop responsive during large-batch encode. - Writes train_rollouts.bin + train_rollouts.routed_experts.bin, with the sidecar landing before the main file (the receiver watches the main file as the "ready" signal). - New v2 format: 4-byte manifest_len + manifest + per-sample (4-byte frame_len + frame). Receiver reconstructs samples from the manifest's offsets/shapes/dtypes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mikasenghaas mentioned this pull request May 24, 2026

perf(orchestrator): reduce event-loop lag #2609

Merged

samsja and others added 5 commits May 24, 2026 00:55

mikasenghaas force-pushed the perf/r3-rewrite branch from 3aeb519 to 65fe273 Compare May 24, 2026 00:56

mikasenghaas force-pushed the perf/r3-sidecar branch from eebf4ec to c76f1aa Compare May 24, 2026 00:57

mikasenghaas force-pushed the perf/r3-rewrite branch from 65fe273 to dbb5a6c Compare May 24, 2026 01:06

mikasenghaas force-pushed the perf/r3-sidecar branch from c76f1aa to c1aecce Compare May 24, 2026 01:06

mikasenghaas force-pushed the perf/r3-rewrite branch 2 times, most recently from 871a69c to b92f7d3 Compare May 24, 2026 01:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(transport): stream-sidecar TrainingBatch sender (routed_experts split)#2612

perf(transport): stream-sidecar TrainingBatch sender (routed_experts split)#2612
mikasenghaas wants to merge 8 commits into
perf/r3-rewritefrom
perf/r3-sidecar

mikasenghaas commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mikasenghaas commented May 24, 2026

Summary

What's in here (one commit)

When to land this

Risk

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants