perf(transport): stream-sidecar TrainingBatch sender (routed_experts split)#2612
Draft
mikasenghaas wants to merge 8 commits into
Draft
perf(transport): stream-sidecar TrainingBatch sender (routed_experts split)#2612mikasenghaas wants to merge 8 commits into
mikasenghaas wants to merge 8 commits into
Conversation
Make FileSystemTrainingBatchSender.send async and run the msgspec encode + disk write in asyncio.to_thread. Keeps the orch event loop responsive during step transitions when the batch payload is large. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…uvloop - await training_batch_sender.send(training_batch) since sender is async (O1+O2) - compute_advantages and apply_filters wrapped in asyncio.to_thread (O3+O4): both iterate ~2k rollouts in pure Python and release the GIL on every bytecode tick, so threading actually helps unlike the C-extension encode - main() installs uvloop (O7) — lower scheduler overhead matters when the orchestrator is juggling many concurrent rollouts + the HTTP client to inference; measurably reduces tail latency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ning p90 alone undersells the long tail we see at step boundaries. p99 is the metric we actually compare across runs, so surface it in the wandb dict (`event_loop_lag/p99`) and in the busy-loop warning line. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
extend_sample's previous implementation unpacked the accumulated routed_experts bytes, mutated one boundary entry, np.concatenated the new step's routings, then re-packed to bytes — every extension. For an N-turn rollout reaching seq_len 60k, the accumulated buffer reached ~23 MB and was fully read+written N times, giving O(N²) byte work (~2.3 GB of memcpy per rollout). At 256 rollouts/step this dominated the orch's step-boundary event-loop stalls more than the encode path we'd already fixed. Defer the concat: track per-sample `chunks: list[np.ndarray]` during the extension loop and finalize once at the end. The "boundary token" entry that vLLM omits for each request is appended as its own one-entry chunk between consecutive steps' contributions, so the final concat is a straight join — no destructive writes to prior chunks. Verified byte-exact against the old implementation on a 50-step extension chain. Per-rollout work drops from O(N²) → O(N): ~100× less memcpy in the realistic regime. This makes use_process_pool unnecessary for the residual orch lag: each rollout's process_rollout cost drops below the IPC pickle cost, so the threaded path is strictly better. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- save_rollouts: json.dump -> orjson.dumps + bytes write. The pure-Python serializer held the GIL for the whole save phase (multi-second spikes on big steps); orjson serializes in C and releases the GIL. - prepare/make/extend mask handling: [bool(i) for i in mask] was the remaining per-turn GIL-held Python loop. make_sample/extend_sample swapped to list(...) / direct extend; prepare_step_tokens uses list(map(bool, ...)). Together drops gather wallclock 5-10% on long-tail rollouts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…10Hz lag monitor - Pretokenize was a no-op for router-replay (every step already has tokens populated by the renderer client), but the 256-way to_thread fanout starved the loop for ~2 s per step. Short-circuit when no rollout's trajectory has unpopulated tokens. Offline attribution: pretokenize lag drops 2050 ms → 11 ms. - EventLoopLagMonitor sampling 1 Hz → 10 Hz (mirrors verifiers' monitor). 1 Hz was missing sub-second spikes that show up clearly at 10 Hz. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3aeb519 to
65fe273
Compare
eebf4ec to
c76f1aa
Compare
Pulls in the env_worker / env_router event-loop lag fixes from PrimeIntellect-ai/verifiers#1453. Required for the orch-side gains in this PR to be visible under realistic 256-rollout concurrency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
65fe273 to
dbb5a6c
Compare
…split) Layered follow-up on perf/r3-rewrite. Keep this for when the to_thread-only send isn't enough — splits routed_experts.data (~85% of payload) into a raw-bytes sidecar so the main msgpack encode no longer pays the base64/copy cost on the hot bytes. - Streams TrainingSample frames per-sample with `await sleep(0)` between, keeping the loop responsive during large-batch encode. - Writes train_rollouts.bin + train_rollouts.routed_experts.bin, with the sidecar landing before the main file (the receiver watches the main file as the "ready" signal). - New v2 format: 4-byte manifest_len + manifest + per-sample (4-byte frame_len + frame). Receiver reconstructs samples from the manifest's offsets/shapes/dtypes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
c76f1aa to
c1aecce
Compare
871a69c to
b92f7d3
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Draft follow-up to #2609. Layered on top of
perf/r3-rewrite— only the transport-side change is new. Merge or rebase this after #2609 lands; or just pick it up directly if the to_thread-only send in #2609 isn't sufficient for our largest router-replay payloads.What's in here (one commit)
perf(transport): stream-sidecar TrainingBatch sender (routed_experts split)routed_experts.dataout of the msgpack encode and writes it as a single raw-bytes sidecar file (train_rollouts.routed_experts.bin). The hot bytes (~85% of payload) skip the base64-style msgpackbinframing on encode and the corresponding allocation on decode.TrainingSampleframes one at a time withawait asyncio.sleep(0)between, so even the encode phase yields to the event loop instead of holding it through a single bigencoder.encode(batch)call.4-byte manifest_len + manifest + (4-byte frame_len + sample-frame) * n. Receiver reconstructs samples from the manifest'soffsets/shapes/dtypesarrays.When to land this
Use as a fallback if #2609 alone doesn't bring step-boundary lag low enough on router-replay runs (Qwen3-30B-A3B + GLM-5.1 territory, ~25 GB routed_experts payload). For non-router-replay workloads the simpler to_thread-only send in #2609 should be sufficient.
Risk
await sleep(0)is harmless but adds N loop ticks per batch (small CPU overhead, large lag reduction).