Skip to content

R3 delta replay picks (no configs)#2648

Open
samsja wants to merge 3 commits into
mainfrom
r3-delta-replay-picks-clean
Open

R3 delta replay picks (no configs)#2648
samsja wants to merge 3 commits into
mainfrom
r3-delta-replay-picks-clean

Conversation

@samsja
Copy link
Copy Markdown
Member

@samsja samsja commented May 27, 2026

Rebased version of #2647 on top of main, with unrelated changes removed.

Commits

  • Implement routed experts delta replay (with branched deltas) — routed_experts.py / serving_tokens.py / trajectories.py + verifiers bump + router pin
  • fix(orchestrator): match longest active prefix in interleave_rollout
  • warn on ambiguous prefix match in interleave_rollout

Removed from the original PR

  • The 4 configs/debug/qwen3_30b_a3b_pd_*_router_replay*.toml debug configs.
  • Handle token export mkdir races commit — unrelated concern, belongs in its own PR.
  • skills/configs/SKILL.md RLM-SWE note — unrelated.
  • rlm-swe / wordle workspace entries in pyproject.toml — unrelated.
  • uv.lock churn from the above (reverted to main).

Dependencies


Note

High Risk
Changes how training samples merge tokens, masks, and MoE routing tensors; wrong prefix or delta stitching would silently corrupt RL training data, though behavior is heavily covered by new unit tests.

Overview
Adds routed-experts delta replay end-to-end: compact payloads now carry a start index (from routed_experts_prompt_start at inference capture through orchestrator merge), so multi-step and branched rollouts can stitch partial vLLM expert tensors instead of mis-aligning rows.

interleave_rollout is updated to extend the longest matching active token prefix (fixes compaction/rollback where a shorter prefix wrongly absorbed a longer sample’s completions) and to warn when multiple prefixes match. New-branch steps can replay routed-expert chunks from a matching prior prefix state when start > 0.

Dependency: vllm-router is wired from a local wheel under third_party/router/dist/ instead of a release URL.

Unit tests cover longest-prefix matching, multi-step/branch expert alignment, and the start field on serialized experts.

Reviewed by Cursor Bugbot for commit fc9fbaf. Bugbot is set up for automated code reviews on this repo. Configure here.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 9a99ff5. Configure here.

Comment thread src/prime_rl/orchestrator/trajectories.py
Comment thread src/prime_rl/orchestrator/trajectories.py
Comment thread src/prime_rl/orchestrator/trajectories.py
samsja and others added 3 commits May 26, 2026 19:03
Squashed from origin/r3-delta (tip 5c94833, which extends the earlier
3799bda with 'Support branched routed expert deltas' for cases where
the routed-experts payload diverges across siblings in a group).

Adapts delta replay to main's deferred routed-experts chunk concat:
first step starts at 0; extended steps use prefix_len - 1; row 0 fills
the boundary, remaining rows append as the new suffix. Bumps router
wheel pin to local-path. Bumps deps/verifiers gitlink to d39cc5876.

Co-Authored-By: S1ro1 <matej.sirovatka@gmail.com>
The first-match-wins loop over active_samples picks the wrong sample when
one active prefix is a strict prefix of another. This can happen after a
compaction/rollback step whose prompt is shorter than an existing
sample's prefix and whose completion re-generates the same tokens and
extends past them: the new sample's prefix then starts with the older
sample's prefix, and any later step that extends the new sample also
satisfies the slice check against the older one.

When that happens, extend_sample folds the newer sample's generated
tokens into the older sample as user-input tokens (mask=False,
logprob=0) and leaves the newer sample stale -- a silent Exact-Prefix
invariant violation.

Switch to longest-match: strictly more specific, never worse than
first-match when only one prefix matches.

Co-authored-by: Cursor <cursoragent@cursor.com>
(cherry picked from commit 0e239d1)
When more than one active prefix matches a step's prompt, log a warning
with the example id, step index, set of matching prefix lengths, total
active prefixes, and the prompt length. Longest-match still picks the
correct extension; the warning just surfaces the rare ambiguous case so
it's debuggable if it starts showing up in real rollouts (e.g. from
compaction/rollback turns).

Co-authored-by: Cursor <cursoragent@cursor.com>
(cherry picked from commit ca38614)
@samsja samsja force-pushed the r3-delta-replay-picks-clean branch from 9a99ff5 to fc9fbaf Compare May 27, 2026 02:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants