R3 delta replay picks (no configs) by samsja · Pull Request #2648 · PrimeIntellect-ai/prime-rl

samsja · 2026-05-27T01:53:43Z

Rebased version of #2647 on top of main, with unrelated changes removed.

Commits

Implement routed experts delta replay (with branched deltas) — routed_experts.py / serving_tokens.py / trajectories.py + verifiers bump + router pin
fix(orchestrator): match longest active prefix in interleave_rollout
warn on ambiguous prefix match in interleave_rollout

Removed from the original PR

The 4 configs/debug/qwen3_30b_a3b_pd_*_router_replay*.toml debug configs.
Handle token export mkdir races commit — unrelated concern, belongs in its own PR.
skills/configs/SKILL.md RLM-SWE note — unrelated.
rlm-swe / wordle workspace entries in pyproject.toml — unrelated.
uv.lock churn from the above (reverted to main).

Dependencies

deps/verifiers bumped to d39cc5876 (as in R3 delta replay picks #2647).
vllm-router pin switched to a local-path wheel that contains the router-side start field plumbing (counterpart to Merge routed experts deltas with start offsets router#37). This needs to flip back to a release URL once router#37 is merged and a wheel is published.

Note

High Risk
Changes how training samples merge tokens, masks, and MoE routing tensors; wrong prefix or delta stitching would silently corrupt RL training data, though behavior is heavily covered by new unit tests.

Overview
Adds routed-experts delta replay end-to-end: compact payloads now carry a start index (from routed_experts_prompt_start at inference capture through orchestrator merge), so multi-step and branched rollouts can stitch partial vLLM expert tensors instead of mis-aligning rows.

interleave_rollout is updated to extend the longest matching active token prefix (fixes compaction/rollback where a shorter prefix wrongly absorbed a longer sample’s completions) and to warn when multiple prefixes match. New-branch steps can replay routed-expert chunks from a matching prior prefix state when start > 0.

Dependency: vllm-router is wired from a local wheel under third_party/router/dist/ instead of a release URL.

Unit tests cover longest-prefix matching, multi-step/branch expert alignment, and the start field on serialized experts.

^{Reviewed by Cursor Bugbot for commit fc9fbaf. Bugbot is set up for automated code reviews on this repo. Configure here.}

cursor

Cursor Bugbot has reviewed your changes and found 3 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 9a99ff5. Configure here.}

Squashed from origin/r3-delta (tip 5c94833, which extends the earlier 3799bda with 'Support branched routed expert deltas' for cases where the routed-experts payload diverges across siblings in a group). Adapts delta replay to main's deferred routed-experts chunk concat: first step starts at 0; extended steps use prefix_len - 1; row 0 fills the boundary, remaining rows append as the new suffix. Bumps router wheel pin to local-path. Bumps deps/verifiers gitlink to d39cc5876. Co-Authored-By: S1ro1 <matej.sirovatka@gmail.com>

The first-match-wins loop over active_samples picks the wrong sample when one active prefix is a strict prefix of another. This can happen after a compaction/rollback step whose prompt is shorter than an existing sample's prefix and whose completion re-generates the same tokens and extends past them: the new sample's prefix then starts with the older sample's prefix, and any later step that extends the new sample also satisfies the slice check against the older one. When that happens, extend_sample folds the newer sample's generated tokens into the older sample as user-input tokens (mask=False, logprob=0) and leaves the newer sample stale -- a silent Exact-Prefix invariant violation. Switch to longest-match: strictly more specific, never worse than first-match when only one prefix matches. Co-authored-by: Cursor <cursoragent@cursor.com> (cherry picked from commit 0e239d1)

When more than one active prefix matches a step's prompt, log a warning with the example id, step index, set of matching prefix lengths, total active prefixes, and the prompt length. Longest-match still picks the correct extension; the warning just surfaces the rare ambiguous case so it's debuggable if it starts showing up in real rollouts (e.g. from compaction/rollback turns). Co-authored-by: Cursor <cursoragent@cursor.com> (cherry picked from commit ca38614)

cursor Bot reviewed May 27, 2026

View reviewed changes

Comment thread src/prime_rl/orchestrator/trajectories.py

Comment thread src/prime_rl/orchestrator/trajectories.py

Comment thread src/prime_rl/orchestrator/trajectories.py

samsja and others added 3 commits May 26, 2026 19:03

samsja force-pushed the r3-delta-replay-picks-clean branch from 9a99ff5 to fc9fbaf Compare May 27, 2026 02:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

R3 delta replay picks (no configs)#2648

R3 delta replay picks (no configs)#2648
samsja wants to merge 3 commits into
mainfrom
r3-delta-replay-picks-clean

samsja commented May 27, 2026 •

edited by cursor Bot

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

samsja commented May 27, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Commits

Removed from the original PR

Dependencies

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

samsja commented May 27, 2026 •

edited by cursor Bot

Loading