Skip to content

chore(configs): router replay perf repro#2614

Draft
mikasenghaas wants to merge 14 commits into
mainfrom
perf/r3-lag-repro
Draft

chore(configs): router replay perf repro#2614
mikasenghaas wants to merge 14 commits into
mainfrom
perf/r3-lag-repro

Conversation

@mikasenghaas
Copy link
Copy Markdown
Member

@mikasenghaas mikasenghaas commented May 24, 2026

Summary

  • Adds configs/r3_perf/debug.toml, the repro config used to investigate the R3 orchestrator event-loop lag on Qwen3-30B-A3B with router replay (2 train / 1 infer / 4 replicas).
  • Pairs with perf(orchestrator): reduce event-loop lag #2609 (perf(orchestrator): reduce event-loop lag) for reproducing the residual step-boundary stall.

🤖 Generated with Claude Code

mikasenghaas and others added 2 commits May 24, 2026 07:32
Configs used to reproduce and iterate on the R3 orchestrator event-loop
lag investigation (router replay, Qwen3-30B-A3B, 2 train / 1 infer / 4
replicas). Inline comments capture the per-round findings (v5 process
pool, v6 gather_chunk_size sweep, v7 wandb samples disable, v9 raw
rollout dump disable).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Removes use_token_client, use_process_pool, dump_raw_rollouts, and
gather_chunk_size — these were experimental knobs used during the R3
lag investigation that never landed in OrchestratorConfig on main.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mikasenghaas mikasenghaas changed the title chore(configs): add r3_perf debug repro chore(configs): router replay perf repro May 24, 2026
mikasenghaas and others added 12 commits May 24, 2026 07:42
…er_node

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…uffer

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…epro

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Shared [model] section already propagates name to inference; declaring
inference.model.name as well now fails the RLConfig shared/sub-config
collision validator.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…is the default

#2615 made X-Session-ID = trajectory_id the default sticky-routing
header, so the explicit override here is redundant.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Required by configs/r3_perf/debug.toml — the orchestrator crashes at
load_environment without it being installed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant