Skip to content

perf(transport): stream-sidecar TrainingBatch sender (routed_experts split)#2612

Draft
mikasenghaas wants to merge 8 commits into
perf/r3-rewritefrom
perf/r3-sidecar
Draft

perf(transport): stream-sidecar TrainingBatch sender (routed_experts split)#2612
mikasenghaas wants to merge 8 commits into
perf/r3-rewritefrom
perf/r3-sidecar

Conversation

@mikasenghaas
Copy link
Copy Markdown
Member

Summary

Draft follow-up to #2609. Layered on top of perf/r3-rewrite — only the transport-side change is new. Merge or rebase this after #2609 lands; or just pick it up directly if the to_thread-only send in #2609 isn't sufficient for our largest router-replay payloads.

What's in here (one commit)

perf(transport): stream-sidecar TrainingBatch sender (routed_experts split)

  • Peels routed_experts.data out of the msgpack encode and writes it as a single raw-bytes sidecar file (train_rollouts.routed_experts.bin). The hot bytes (~85% of payload) skip the base64-style msgpack bin framing on encode and the corresponding allocation on decode.
  • Streams TrainingSample frames one at a time with await asyncio.sleep(0) between, so even the encode phase yields to the event loop instead of holding it through a single big encoder.encode(batch) call.
  • Sidecar must be visible before the main file — receiver watches the main file as the "batch is ready" signal.
  • v2 file format: 4-byte manifest_len + manifest + (4-byte frame_len + sample-frame) * n. Receiver reconstructs samples from the manifest's offsets/shapes/dtypes arrays.

When to land this

Use as a fallback if #2609 alone doesn't bring step-boundary lag low enough on router-replay runs (Qwen3-30B-A3B + GLM-5.1 territory, ~25 GB routed_experts payload). For non-router-replay workloads the simpler to_thread-only send in #2609 should be sufficient.

Risk

  • New on-disk format (v2). Backward-compat would require keeping both readers around; this PR deletes the v1 reader. Old run artifacts won't be readable by this branch. Acceptable for an experimental knob; not for general release until v1 is migrated away from.
  • Per-sample await sleep(0) is harmless but adds N loop ticks per batch (small CPU overhead, large lag reduction).

Make FileSystemTrainingBatchSender.send async and run the msgspec encode
+ disk write in asyncio.to_thread. Keeps the orch event loop responsive
during step transitions when the batch payload is large.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
samsja and others added 5 commits May 24, 2026 00:55
…uvloop

- await training_batch_sender.send(training_batch) since sender is async (O1+O2)
- compute_advantages and apply_filters wrapped in asyncio.to_thread (O3+O4):
  both iterate ~2k rollouts in pure Python and release the GIL on every
  bytecode tick, so threading actually helps unlike the C-extension encode
- main() installs uvloop (O7) — lower scheduler overhead matters when the
  orchestrator is juggling many concurrent rollouts + the HTTP client to
  inference; measurably reduces tail latency.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ning

p90 alone undersells the long tail we see at step boundaries. p99 is the
metric we actually compare across runs, so surface it in the wandb dict
(`event_loop_lag/p99`) and in the busy-loop warning line.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
extend_sample's previous implementation unpacked the accumulated
routed_experts bytes, mutated one boundary entry, np.concatenated the new
step's routings, then re-packed to bytes — every extension. For an N-turn
rollout reaching seq_len 60k, the accumulated buffer reached ~23 MB and was
fully read+written N times, giving O(N²) byte work (~2.3 GB of memcpy per
rollout). At 256 rollouts/step this dominated the orch's step-boundary
event-loop stalls more than the encode path we'd already fixed.

Defer the concat: track per-sample `chunks: list[np.ndarray]` during the
extension loop and finalize once at the end. The "boundary token" entry
that vLLM omits for each request is appended as its own one-entry chunk
between consecutive steps' contributions, so the final concat is a
straight join — no destructive writes to prior chunks.

Verified byte-exact against the old implementation on a 50-step extension
chain. Per-rollout work drops from O(N²) → O(N): ~100× less memcpy in the
realistic regime.

This makes use_process_pool unnecessary for the residual orch lag: each
rollout's process_rollout cost drops below the IPC pickle cost, so the
threaded path is strictly better.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- save_rollouts: json.dump -> orjson.dumps + bytes write. The pure-Python
  serializer held the GIL for the whole save phase (multi-second spikes
  on big steps); orjson serializes in C and releases the GIL.
- prepare/make/extend mask handling: [bool(i) for i in mask] was the
  remaining per-turn GIL-held Python loop. make_sample/extend_sample
  swapped to list(...) / direct extend; prepare_step_tokens uses
  list(map(bool, ...)). Together drops gather wallclock 5-10% on
  long-tail rollouts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…10Hz lag monitor

- Pretokenize was a no-op for router-replay (every step already has
  tokens populated by the renderer client), but the 256-way to_thread
  fanout starved the loop for ~2 s per step. Short-circuit when no
  rollout's trajectory has unpopulated tokens. Offline attribution:
  pretokenize lag drops 2050 ms → 11 ms.
- EventLoopLagMonitor sampling 1 Hz → 10 Hz (mirrors verifiers'
  monitor). 1 Hz was missing sub-second spikes that show up clearly
  at 10 Hz.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pulls in the env_worker / env_router event-loop lag fixes from
PrimeIntellect-ai/verifiers#1453. Required for the orch-side gains in
this PR to be visible under realistic 256-rollout concurrency.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…split)

Layered follow-up on perf/r3-rewrite. Keep this for when the
to_thread-only send isn't enough — splits routed_experts.data
(~85% of payload) into a raw-bytes sidecar so the main msgpack
encode no longer pays the base64/copy cost on the hot bytes.

- Streams TrainingSample frames per-sample with `await sleep(0)`
  between, keeping the loop responsive during large-batch encode.
- Writes train_rollouts.bin + train_rollouts.routed_experts.bin,
  with the sidecar landing before the main file (the receiver
  watches the main file as the "ready" signal).
- New v2 format: 4-byte manifest_len + manifest + per-sample
  (4-byte frame_len + frame). Receiver reconstructs samples from
  the manifest's offsets/shapes/dtypes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mikasenghaas mikasenghaas force-pushed the perf/r3-rewrite branch 2 times, most recently from 871a69c to b92f7d3 Compare May 24, 2026 01:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants