Feat/sft on tool outputs by snimu · Pull Request #2625 · PrimeIntellect-ai/prime-rl

snimu · 2026-05-25T15:27:23Z

No description provided.

Adds the per-env ``SFTConfig`` (on_tool_outputs / alpha / tool_names) under ``TrainEnvConfig.sft`` and a global ``disable_echo`` knob on ``DefaultLossConfig``. Purely additive — both default to off, so this commit has no runtime effect on existing configs. The follow-up commits wire the actual SFT-on-tool-body objective: per-token mask construction in the orchestrator, advantage overlay in the trainer, and IS-ratio / DPPO / KL gating in the loss.

Two additive fields on TrainingSample (sft_mask: list[bool] | None, sft_alpha: float | None) and one on MicroBatch (sft_mask). Distinct from the existing ``sft_loss: bool`` flag, which switches the whole sample to the standalone ``sft_loss_fn``. ``sft_mask`` co-exists with the RL loss: the trainer overlays alpha/n on sft_mask positions of the advantages tensor while leaving assistant tokens under their normal RL advantage. Both fields default to None — zero serialization cost for samples whose env doesn't use SFT. Follow-up commits wire mask construction (orchestrator side) and advantage overlay + loss gating (trainer side).

Adds a pure helper ``_step_sft_mask`` that, given the renderer's ``prompt_attribution`` + verifiers' ``prompt_message_tool_names`` for one trajectory step and the env's ``SFTConfig``, returns a per-token bool mask: True iff the token is the body (``is_content=True``) of a tool-role message whose function name is in the env's allowlist (``tool_names``, or any tool when None). Completion-side entries uniformly False; all-False fallback when SFT is off / attribution is missing. ``interleave_rollout`` gains an ``sft_config`` parameter. Each step's mask is computed once in ``prepare_step_tokens`` and carried into ``TrainingSample.sft_mask`` via ``make_sample`` / ``extend_sample`` (the merge path extends the mask in lockstep with completion_mask: the new prompt tail can carry tool-body tokens, the new completion is uniformly False since the model never samples tool tokens). The orchestrator looks up the per-env SFTConfig via ``train_envs.get(env_name).config.sft`` at the call site, passes it into ``interleave_rollout``, and attaches the per-env ``alpha`` onto ``sample.sft_alpha`` at sample finalization (right next to where ``advantage`` and ``env_name`` are set). Trainer-side overlay + loss gating come in follow-up commits.

Threads ``sft_mask`` + ``sft_alpha`` through the per-sample prepare / packing / padding path: - ``prepare_sample`` rewrites the per-token advantage to ``alpha / n_sft_tokens`` (or ``alpha`` if ``disable_echo``) on mask positions and flips them into ``loss_mask=True`` so they contribute to the loss. The constant per-rollout normalization keeps long rollouts from dominating short ones on the SFT term. - Truncation / packing / padding extend ``sft_mask`` in lockstep with ``loss_mask``. A bin that previously had no SFT signal materializes the all-False prefix lazily when the first SFT sample joins. - ``_make_dummy_batch`` zeros ``sft_mask`` alongside ``advantages`` and ``loss_mask`` so distribution padding never accidentally pulls SFT gradient. - ``prepare_batch`` gains a ``disable_echo`` kwarg; train.py wires it from ``trainer.loss.disable_echo`` in the next commit. Length-equality assertion extended to cover ``sft_mask``. Loss-side gating (force IS ratio = 1, skip DPPO trust-region clipping, zero KL on SFT positions) lands in the next commit.

Three surgical edits inside ``default_loss_fn``, gated on the new ``LossInputs.sft_mask`` (None when the rollout's env has SFT off): - Force ``log_importance_ratio = 0`` on SFT positions → importance_ratio=1 and mismatch_kl=0. The downstream ``kl_loss = loss_mask * log_importance_ratio**2`` term is then zero on these tokens by construction; no separate KL exclusion needed. - Force ``dppo_invalid_mask`` to False on SFT positions. The trust-region check compares trainer vs inference logprobs, but the inference logprob for prompt tokens is the placeholder 0.0 → ``probs_diff ≈ exp(trainer_logprob) - 1`` and would silently mask the SFT gradient out every step. Excluding SFT positions from the check is what makes the overlay actually train. - New metrics ``sft_nll_mean`` / ``sft_nll_max`` / ``sft_token_count`` computed over ``loss_mask & sft_mask``. The world-model loss curve and per-batch spike detector. No changes for rollouts whose env has SFT off — sft_mask=None makes the new branches no-ops and the metrics absent from the output dict.

End-to-end plumbing so the SFT-on-tool-body overlay actually reaches the GPU loss: - ``compute_loss`` accepts a parallel ``sft_mask`` list and threads each rollout's mask into the corresponding ``LossInputs``. Default ``None`` matches the old call shape, so non-SFT runs are byte-identical. - ``BasePacker`` (and ``SinglePacker`` / ``MultiPacker`` / ``setup_packer``) carry ``disable_echo`` so ``prepare_batch`` knows whether to apply the ECHO length normalization. ``DataLoader.__init__`` plumbs it through to the packer factory; ``train.py`` reads it off ``config.loss`` (defensive ``getattr`` since SFTLossConfig / CustomLossConfig don't carry the flag). - ``TensorMicroBatch`` gains an ``sft_mask`` field; the ``MicroBatch → TensorMicroBatch`` conversion materializes it onto a bool tensor (with the standard ``unsqueeze(0)`` batch dim) when present. - ``train.py:compute_loss`` call site reads the per-batch mask, moves it to CUDA, ``split``s it on response_lengths, and passes through. Conservative test additions (per AGENTS.md: only pure data transformations): - ``test_step_sft_mask_*`` (8 cases) exercising the orchestrator-side mask construction: SFTConfig None / disabled / missing attribution / missing names → all-False; tools=None → all body of all tools; tools=[...] → filtered; non-content tokens stay False; completion uniformly False. - ``test_prepare_sample_overlays_sft_advantage_*`` (4 cases) on the advantage overlay: length-normalized weight, disable_echo constant weight, no-op when alpha is missing, truncation slices the mask in lockstep.

…anches Pins ``deps/renderers`` at the tip of ``sebastian/content-mask-2026-05-19`` (adds ``is_content`` to RenderedTokens + surfaces ``prompt_attribution`` on generate()). Pins ``deps/verifiers`` at the tip of ``sebastian/prompt-tool-names-2026-05-19`` (stacked on ``sebastian/renderers-pass-through-info-2026-05-19``): carries ``prompt_attribution`` through ResponseTokens → TrajectoryStepTokens and adds per-message ``prompt_message_tool_names`` lookup. The orchestrator + trainer changes on this branch consume both APIs. NOT FOR MERGE — these branches aren't pushed to origin yet, so the submodule SHAs are only reachable via the local working trees here. For training runs that clone fresh, push the upstream branches first.

Records ``branch = sebastian/<feature>`` on both submodules so a fresh clone followed by ``git submodule update --init --remote`` tracks the SFT-on-tool-body upstream work end-to-end. The branches are now on origin (no longer local-only): - renderers: sebastian/content-mask-2026-05-19 - verifiers: sebastian/prompt-tool-names-2026-05-19 (stacked on sebastian/renderers-pass-through-info-2026-05-19) NOT FOR MERGE — when the upstream PRs land on main, drop these ``branch`` directives and re-pin both submodules to main.

The SFT-on-tool-body overlay computed weight = alpha / n_sft_tokens. That's not the ECHO objective — ECHO normalizes by the total rollout length, so the per-rollout SFT loss contribution scales as alpha × (n_sft_tokens / total_rollout_length), proportional to how much of the rollout was tool body. The old formulation gave every rollout a constant alpha total SFT contribution regardless of how much tool material it actually had, biasing updates toward sparse-SFT rollouts (each token getting a larger share). In practice the two are off by a factor of (n_sft / total_length). For a rollout that's 30% tool body, the old impl applied ~3.3x the intended SFT weight; for 5% tool body, ~20x. Fix: change ``weight = sft_alpha / n_sft`` → ``weight = sft_alpha / len(input_ids)``. Same shape on the loss_mask flip, IS=1 forcing, KL exclusion, etc — only the per-token magnitude changes. ``disable_echo=True`` is unaffected (still constant alpha). Updates to keep contracts consistent: - src/prime_rl/trainer/batch.py: the actual logic + docstring. - src/prime_rl/transport/types.py: comment on TrainingSample.sft_alpha. - src/prime_rl/trainer/rl/packer.py: comment on disable_echo plumbing. - packages/.../orchestrator.py: SFTConfig docstring. - packages/.../trainer.py: DefaultLossConfig.disable_echo description. - tests/unit/orchestrator/test_batch.py: test fixture expectations (n_sft=2, alpha=0.5 → was 0.25, now 0.125 since total_length=4). Implication for in-flight Forth runs: they were started under the old (too-hot) normalization with alpha=0.5. They keep their original behavior — those runs are now "high-alpha runs" by accident, and the correctly-normalized variant becomes a separate experimental cell once we re-tune alpha against the fixed formulation. Smoke-verified locally against three fixtures (4-token + 10-token inputs, ECHO vs disable_echo) — total SFT contribution scales as alpha × (n_sft / total_length) as expected.

Replace the trainer-level boolean ``disable_echo`` flag with a per-env ``SFTConfig.normalization`` Literal so each env can pick how the SFT-on-tool-body advantage weight scales with rollout shape. Four modes (notation: α = SFTConfig.alpha, T = total rollout length, S = n_sft_tokens, R = n_rl_tokens = sum(completion_mask)): all_tokens (default, ECHO): weight = α / T → total = α × (S/T) sft_tokens: weight = α / S → total = α (constant) ratio: weight = α × R/S → total = α × R none: weight = α → total = α × S Motivation: experimental evidence shows the right α depends on the ratio of environment-tokens to model-tokens. ``ratio`` calibrates the SFT contribution against the RL signal magnitude — when SFT is rare (sparse tool outputs), per-SFT-token weight goes up; when SFT is dense (most of the rollout is tool body), it goes down. All counts are taken pre-truncation from the rollout's TrainingSample so the weight reflects the actual rollout shape, not the artifact of seq_len truncation. Touch points: - SFTConfig: add normalization field with full docstring spelling out the four formulas. - DefaultLossConfig: drop disable_echo entirely (was a global flag; per-env normalization replaces it). - TrainingSample: add sft_normalization transport field; the orchestrator populates it from per-env SFTConfig alongside sft_alpha. - prepare_sample: dispatch on training_example.sft_normalization; defaults to "all_tokens" when None (belt-and-braces against older rollouts). Unknown modes raise ValueError loudly. - packer.py / data.py / train.py: remove all disable_echo plumbing (no longer global). - Tests: replace the two existing SFT-overlay tests with five — one per mode (all_tokens, sft_tokens, ratio, none), one for the None→all_tokens fallback, and one pinning the unknown-mode ValueError path. Migration footgun (for the live Forth runs): they're operating on the pre-fix code (effectively "sft_tokens" normalization) and their configs don't set the new field. Restarting from the new branch will default to "all_tokens" — different normalization, different effective α. Either update the configs explicitly to "sft_tokens" before restart, or treat any restart as a new experimental cell. The in-progress runs themselves are unaffected (they're on the old branch). Smoke-verified locally against all four modes plus the unknown-mode error path and the ratio R=0 degenerate case.

) PR #53 (per-token is_content mask for body/scaffold attribution) was merged to renderers/main at 1691f87. Update the submodule pointer from the now-unneeded sebastian/content-mask-2026-05-19 feature branch to main, and drop the `branch = sebastian/...` directive from .gitmodules so a fresh `git submodule update --init --remote` walks to main's tip by default. deps/renderers: 281d89b (sebastian/content-mask-2026-05-19) → 1691f87 (origin/main, includes the merged PR plus #54 routed-experts, #55 kimi tool schema, #56 idna bump, #57 fastokens bump) Verified is_content infrastructure is present in base.py, qwen3.py, qwen3_vl.py, qwen35.py, kimi_k25.py — matches what was on the feature branch. No code change needed elsewhere; the renderers API surface is the same. Note: deps/verifiers is still pinned to sebastian/prompt-tool-names-2026-05-19 — the verifiers PRs haven't landed yet. Same treatment whenever they do.

* fix(wandb): handle shared mode breakage from wandb v0.26.1 wandb v0.26.1 (PR #11759) changed `inform_init` to propagate duplicate stream ID errors as `ServerResponseError` instead of silently swallowing them. This breaks shared mode where multiple processes intentionally call `wandb.init(id=same_run_id)`. - Add `resume="allow"` when a run ID is provided so wandb attaches to the existing run instead of rejecting it - Catch `ServerResponseError` (alongside `CommError`) for non-primary shared-mode processes to handle transient init races Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(wandb): teardown wandb-core between init retries A failed wandb.init leaves the run_id registered in the local wandb-core StreamMux. The next retry hits the same wandb-core, finds the stream still there, and raises ServerResponseError ("run ID ... is in use") regardless of resume="allow" — the check fires before reaching the server. Tearing down the service before sleeping clears the StreamMux so the retry starts fresh. Also catch ServerResponseError on the primary side, since the race is symmetric: either side can win the upsertBucket and the other can hit this path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…2589) * feat(orchestrator): per-env state_columns for extra rollout fields Adds `state_columns: list[str] = []` to `EnvConfig` so each env can persist additional `State` fields into the saved JSONL rollouts on top of the always-saved `trajectory` and `sampling_args`. Merged at the call site (required first, deduped). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor: drop seen set from state_columns dedup Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ion) Picks up the forth-lang configs/private branch (feat/sft-on-tool-outputs-forth-lang) with the disable_echo → SFTConfig.normalization migration. Pairs the qwen + glm45 cmb-*-ct.toml cells with `normalization = "sft_tokens"` and the qwen *-no-norm.toml ablations with `normalization = "none"` — byte-identical mapping to the disable_echo era. Required for the next batch of forth-lang experiments (the `all_tokens` and `ratio` cells will be added in follow-up commits on the same configs/private branch). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

default_loss_fn previously routed SFT positions through importance_ratio which was forced to 1 via: log_importance_ratio = torch.where(sft_mask, torch.zeros_like(...), ...) importance_ratio = torch.exp(log_importance_ratio) pg_loss = keep_mask * advantages * importance_ratio torch.where(cond, zeros_like(x), x) blocks the gradient through ``x`` on the True branch — so on SFT positions importance_ratio became a constant 1 with no gradient w.r.t. trainer_logprobs, and pg_loss became a constant ``advantages``. The SFT gradient was silently zero. Combined runs we believed were SFT-on-tool-body were effectively pure RL. Fix routes SFT positions through ``advantages * trainer_logprobs`` directly so the gradient w.r.t. trainer_logprobs equals the per-token SFT weight (overlaid in prepare_sample by the chosen normalization mode). After the negation in ``loss = -pg_loss.sum()``, this gives parameter updates in the +∇log p direction on SFT positions — pure SFT. RL positions are unchanged: still use ``keep_mask * advantages * importance_ratio`` with the trust-region keep_mask. The log_importance_ratio zeroing on SFT positions stays in place for the KL term (kl_loss = 0 on SFT positions is still correct). Regression test in tests/unit/train/rl/test_loss.py: ``test_default_loss_fn_sft_mask_gradient_flows`` asserts the gradient on SFT positions is non-zero and equals -advantage[s]. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Picks up configs/private 03fd21f — drops the old forth-lang TOMLs (which were running against the silent SFT-gradient bug fixed in the previous commit), and adds: - forth-lang: qwen-rl + 4 cmb-code cells (one per normalizer) + sweep/ - deepdive: qwen-rl + 4 cmb cells (no tool_names filter) + sweep/ - alpha_sweep_launcher.sh: 7 αs × 4 norms × 2 envs = 56 runs, paired with this branch's loss fix so the runs actually exercise SFT. Topology: 2 nodes / bs=256 throughout. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…t test ``.cuda()`` after ``requires_grad=True`` makes the device tensor a non-leaf, so ``.grad`` never populates after backward(). Switch to ``device="cuda"`` in the constructor so it's a leaf on the GPU. Caught when running the test on the cluster against the gradient fix.

prompt_attribution arrives at the orchestrator as a plain dict through the verifiers env-server JSON boundary, even though the renderer emits a RenderedTokens object. The previous code assumed attribute access (``prompt_attribution.message_indices``), which crashed on the cluster: AttributeError: 'dict' object has no attribute 'message_indices' Handle both forms — getattr fallback for the object case, dict.get for the wire-serialized case. Per the function's existing precondition (no runtime type check on ``prompt_attribution: Any``), this is the right contract: the SFT mask should be computable from whatever shape the caller passed in as long as the data is present. Caught when the cluster smoke test crashed at step 0 rollout generation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

prompt_attribution always arrives as a plain dict through the verifiers env-server JSON boundary — the renderer's RenderedTokens object loses its identity in serialization. Drop the prior defensive isinstance/ getattr branching and just use dict access. Updates the type annotation on the parameter to ``dict | None`` to match.

The previous design flipped SFT positions into ``loss_mask`` and routed everything through ``default_loss_fn``. That had two problems: 1. Double normalization on SFT. SFT positions were divided by the per- rollout mode weight (α/X, set in ``prepare_sample``) AND by the batch-level ``loss_scale`` (= total loss_mask tokens, now including SFT). The per-SFT-token gradient ended up at α/(X × loss_scale), an ~S-times smaller magnitude than the legacy ``sft_loss_weight / N_rl`` regime the production α=0.5 was tuned against. 2. Gradient flow on SFT positions went through ``torch.where(sft_mask, zeros, log_importance_ratio)``, which blocks the gradient through ``trainer_logprobs`` on the True branch. The intent (force IS-ratio to 1 for off-policy correction skip) silently disabled the SFT signal entirely. Restructure to match the shape ECHO's algorithm actually has: - RL term: batch-level normalization by N_rl_batch (the existing ``loss_scale``, now RL-only again). - SFT term: per-rollout normalization by the chosen mode (α, α/S, α/T, α×R/S — set in ``prepare_sample`` on the advantage tensor). No batch normalization on top. Implementation: - ``prepare_sample``: stop flipping SFT into ``loss_mask``. Keep setting ``advantage[sft_pos] = mode_weight``. - ``LossInputs.sft_mask`` removed — no longer needed since SFT and RL flow through separate ``compute_loss`` calls. - ``default_loss_fn``: RL-only. All ``sft_mask``-conditional branches dropped. - New ``sft_pg_loss_fn``: ``loss = -(advantages * trainer_logprobs * loss_mask).sum()``. Caller passes the SFT mask as ``loss_mask`` and ``loss_scale=1`` so no batch division. - ``train.py``: second ``compute_loss`` call after the RL one, sharing the same ``out["logprobs"]`` tensor — single forward + single backward covers both terms. - Regression test: replaced the broken ``test_default_loss_fn_sft_mask_ gradient_flows`` with ``test_sft_pg_loss_fn_gradient_magnitude`` that exercises the new function directly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three follow-ups from review of the SFT/RL split: 1. prepare_sample now clears sft_mask to None when sft_alpha is unset. Previously sft_mask propagated downstream regardless, and train.py would run sft_pg_loss_fn over the rollout's raw RL advantage on tool tokens — reward-shaped, not SFT-direction. With the mask cleared, the trainer skips the SFT compute_loss call as intended. 2. _step_sft_mask now bails to all-False when prompt_attribution lacks "message_indices" or "is_content" keys, matching the docstring's "missing/partial attribution returns all-False" contract. The DefaultRenderer leaves these unpopulated; callers shouldn't have to special-case the no-attribution path. 3. test_batch.py fixtures updated for the new design: SFT prompt tokens stay out of loss_mask (= [False, False, True, True] for the 2-prompt + 2-completion fixture). Added an assertion that prepare_sample clears sft_mask to None when sft_alpha is missing. Plus a docstring update in batch.py clarifying that "all_tokens" mode (α/T) is NOT exactly ECHO — the paper uses Z=|𝒪|, which matches "sft_tokens" when no tool_names filter is set. "all_tokens" with T (prompt + completion) is strictly stricter.

…dling" This reverts commit 6d831cd.

…lization" This reverts commit bc01f94.

Per team decision: SFT-on-tool-body is RL credit assignment with a flat positive advantage on tool body positions. Drops the four-mode normalization machinery entirely: - SFTConfig: removed ``normalization`` field. Keep on_tool_outputs, alpha (default 1.0), tool_names. - TrainingSample: removed sft_normalization field. - prepare_sample: collapsed the 4-mode dispatch to just ``advantages[k] = sft_alpha`` on SFT-mask positions. - orchestrator.py: drop sft_normalization assignment. - train.py: drop the stale doc comment about per-rollout mode dispatch. - tests: replaced 4-mode parametric tests with one ``advantage = alpha`` assertion. default_loss_fn keeps the torch.where SFT branch (advantage * trainer_logp on SFT positions, log_IR forced to 0 for KL = 0 on SFT positions) — the mechanism is correct under the RL-credit-assignment framing, just no longer dispatched over four modes. Submodule bump picks up the matching configs/private cleanup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Submodule conflicts in configs/private and deps/verifiers resolved by keeping the feature branch's pointers (ours). The submodule contents themselves will be updated to incorporate main's submodule-side changes in a follow-up commit.

Submodule now points at 43016ea6 (verifiers main tip — "Set routed experts replay start for bridged prompts (#1466)" plus the rest of the post-merge work that landed since the SFT-on-tool-body feature branch was forked). The merged-into-main version of the renderer-attribution PR (#1414, the squash of the earlier feature commit ``5171ae9b``) provides the existing ``TrajectoryStepTokens.prompt_attribution`` sidecar. The new ``RenderedTokens.message_tool_names`` field added in the renderers PR will surface through that sidecar automatically once the renderers pin is bumped; no further verifiers-side changes are needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Pins deps/renderers at d7c160b on `sebastian/tool-names-2026-05-28`, which adds ``RenderedTokens.message_tool_names`` — a per-message sidecar parallel to ``message_roles`` carrying the tool function name for each tool-role message in the rendered slice. Filled from ``msg["name"]`` when set, otherwise via the ``tool_call_id → assistant.tool_calls[i].function.name`` join. This is the metadata source the trainer-side SFT-on-tool-body loss reads (via ``prompt_attribution["message_tool_names"]`` on each ``TrajectoryStepTokens``). Repoint to renderers main once the upstream PR lands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…h-lang Switches the submodule to the feature-branch tip (25fc6a61) that carries the forth-lang + deepdive sweep work for this branch: clean-slate alpha-sweep matrix, deepdive prod cells, qwen ac_offloading fixes, launcher polling slurm + start.sh wrapping, num_workers=32 on forth-lang-test eval blocks, and the migration to SFT four-mode → single-mode (RL credit assignment) normalization. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…05-11 Switches deps/research-environments from sebastian/forth-env (the prior synth-heavy general-agent branch, 995e1a6f) to the forth-lang v1-migrated branch (6fe7c665), which includes: - forth_lang migrated to verifiers v1 Taskset/Harness (#406) - forth_lang bumped to v0.2.0 — T0-T5 tier scheme, 419 rows - Sandbox image v3 (root-owned /opt/forth-lang, docs prebuilt) - run_filter wired to the real word_to_call task filter - The just-completed merge of main into the branch — preserving the v1-aware test_envs.py probe path (3-way exit code: 0 = SingleTurnEnv, 2 = v1 vf.Env, 1 = v0 multi-turn) on top of main's ``_load_environment_python_code`` helper + ``shlex.quote`` refactor. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

b6961cf adds two prod cells under ``sft-on-tool-outputs-forth-lang/``: - ``glm51_rl``: pure RL baseline on GLM-5.1 (8 train + 4 infer nodes, sign_sgd, bs=128). - ``glm51_cmb_code_a0p05``: SFT-on-tool-body with α=0.05 on the code-execution tools (``run_code``, ``submit_code``); excludes ``lookup_docs``. Env args use the v1-nested form (``args.config.taskset`` / ``args.config.harness``), required after the forth-lang v1 migration that landed in research-environments. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The metadata source for the SFT-on-tool-body mask moved out of a sibling ``TrajectoryStepTokens.prompt_message_tool_names`` field (that was going to land via the dropped verifiers ``cbc191d2``) and onto the existing ``prompt_attribution`` sidecar as ``message_tool_names`` — auto-populated by every renderer's ``RenderedTokens`` and serialised through the verifiers env-server via the existing ``dataclasses.asdict`` pass. ``_step_sft_mask`` now reads ``prompt_attribution["message_tool_names"]`` directly, dropping the sibling-field parameter. Behaviour is unchanged when the field is populated; falls through to all-False when the renderer pin predates the new field (older trajectories, replay). Also repairs the ``test_step_sft_mask_*`` unit tests, which were silently broken since the production helper switched to dict access on ``prompt_attribution`` — the ``_StubAttribution`` class they used only supported attribute access. Replaced with a plain dict-shaped ``_attribution`` factory that matches the serialised production payload. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

8c67f5f4b brings in: - forth_lang: hash-based holdout filter — three new fields (``holdout_fraction``, ``holdout_seed``, ``holdout_side``) on ``ForthLangTasksetConfig`` for deterministic, dataset-stable train/test splits keyed by ``word_to_call``. - forth_lang: ``dataset_repo`` default read at config-construction time (via ``Field(default_factory=...)``) so ``FORTH_LANG_TASKS_REPO`` works for TOML/CLI flows. - forth_lang: ``scripts/run_filter.py`` nests env-args under ``config.taskset`` / ``config.harness`` for v1 load_environment. - ruff formatter pass. Additive change — existing configs that still use ``word_to_call`` / ``exclude_word_to_call`` are unaffected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

f12f6f1 brings in: - forth-lang: swap the 85-word hardcoded train/test holdout for the new hash-based ``holdout_*`` filter. 17 configs migrated; -542 net lines. - forth-lang: nest env-args under ``config.taskset`` / ``config.harness`` for v1 ``load_environment`` (qwen cells — GLM-5.1 cells already shipped nested). - Restores the dropped ``tiers = [N]`` filter on the per-tier eval entries of the ``_nfpt`` variants. Requires deps/research-environments at 8c67f5f4b or newer (the commit that added ``holdout_fraction`` / ``holdout_seed`` / ``holdout_side`` to ``ForthLangTasksetConfig``) — already pinned in this branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- deps/research-environments → 58e98d58e: adds ``sandbox_labels: list[str]`` to ``ForthLangTasksetConfig``, merged into the default sandbox's ``labels`` at config load time. Lets callers tag sandboxes without overriding the full ``sandbox`` block. - configs/private → 4b9d48648: sets ``sandbox_labels = ["forth-lang", "<cell-name>"]`` on every forth-lang env (52 sections across 17 configs); adds ``start.sh`` to the two new GLM-5.1 cells. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…d cleanup) ae62ccc66 brings in: - forth-lang+deepdive: ``uv pip install`` line in every launch script now includes ``-e "${DEEPDIVE_ENV_DIR}"`` (was defined but skipped on the launcher node, so any rollout that imports the ``deepdive`` env package crashed before the slurm pre_run_command ran). 17 scripts fixed. - forth-lang: drop the out-of-tree ``~/research-prod`` checkout from every start.sh + TOML ``pre_run_command``. The ``forth_lang`` env now installs from ``${PRIME_RL_ROOT}/deps/research-environments/`` like the other three envs (38 files). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

927668b drops ``max_async_level = 1`` from every config — the field was removed from ``RLConfig`` upstream (prime-rl #2631 "hardcode async-barrier semantics") and now raises an "extra inputs not permitted" validation error. Pure no-op (every config already set the value to 1, which is the new hardcoded default). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Repoints deps/renderers from the feature-branch tip ``d7c160b`` to the squash-merged commit on main ``a11f8d8`` ("feat(base): add ``message_tool_names`` field for per-message tool attribution (#74)"). Same change, now upstream. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…eepdive) b3e7d9e brings in: - ``fl_sanity`` cell: 10-step Qwen3-4B combined-code sanity config for validating the post-merge pipeline (holdout filter, sandbox_labels, SFT-on-tool-body wiring, deepdive install, ckpt + eval lifecycle) before the GLM-5.1 prod runs. Fresh ``output_dir``, tiny evals (5 examples each). - ``deepdive`` added to ``pre_run_command`` in 18 configs that were missing it — same root cause as the recent start.sh fix but on the slurm side. Without this, every rollout hitting the deepdive eval env would crash on import. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2823cda adds three production cells under ``sft-on-tool-outputs-forth-lang/``: - ``glm45air_rl``: pure RL baseline. - ``glm45air_cmb_code_a0p05``: SFT-on-tool-body α=0.05 restricted to ``[run_code, submit_code]``. - ``glm45air_cmb_all_a0p05``: SFT-on-tool-body α=0.05 across every tool (lookup_docs included via the env's default). Based on mika's tau2-synth-glm45-air template with seq_len bumped to 32k (fits with cp=2) and 960 GB CPU KV cache offload per inference node. 2 train + 2 infer nodes, tp=8, muon optimizer. Same forth-lang env shape as the existing GLM-5.1 / qwen cells. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… cells) 01880a6 sets ``[orchestrator.renderer] name = "glm-4.5"`` on the three GLM-4.5-Air cells. ``PrimeIntellect/GLM-4.5-Air`` isn't in ``MODEL_RENDERER_MAP``, so ``"auto"`` would silently fall back to ``DefaultRenderer`` and fail preflight. Mirror the GLM-5.1 pattern. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

``deps/research-environments/environments/forth_lang/pyproject.toml`` pins ``verifiers>=0.1.15.dev9`` (published 2026-05-26). The rolling ``[tool.uv] exclude-newer = "7 days"`` window resolves to ~5 days before that, so ``uv pip install -e ${FORTH_ENV_DIR}`` (in each forth-lang cell's ``start.sh``) fails to resolve verifiers. The workspace ``[tool.uv.sources] verifiers = { workspace = true }`` mapping points uv at the in-tree ``deps/verifiers`` for the main ``uv sync`` — but ``uv pip install -e <env>`` does a pip-style fresh resolve that doesn't honour workspace sources for transitive deps. Mirror the established pattern (cf. ``fastokens = false`` for the analogous renderers/fastokens case): add ``verifiers = false`` to ``[tool.uv.exclude-newer-package]``. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…l race) 982468c removes ``pre_run_command`` from every forth-lang and deepdive cell. ``start.sh``'s ``uv pip install -e ...`` runs once on the launcher into the shared beegfs ``.venv``, which propagates to compute nodes — the duplicate slurm-side install was racing across the fan-out and breaking GLM-4.5-Air launches. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ydantic-config beb1528's ``--no-install-project`` in the sbatch ``uv sync`` only covers the root project. Workspace members declared in ``[tool.uv.workspace]`` are still uninstalled/reinstalled by every fresh-shell sync, opening the same FileNotFoundError window for anything that imports from them. The orchestrator subprocess imports ``prime_rl.configs.*`` (from ``prime-rl-configs``), which in turn imports ``pydantic_config`` (from ``prime-pydantic-config``). When the sbatch fan-out has multiple compute nodes hitting the shared beegfs ``.venv`` concurrently, the brief reinstall window on either workspace member kills the orchestrator on whichever rank loses the race. Why it didn't bite before: - Most SFT-on-tool-outputs cells had ``pre_run_command`` doing an ``uv pip install -e ...`` of the env packages. That added a couple of seconds between the sbatch ``uv sync`` and the actual orchestrator subprocess spawn — an accidental buffer that let any in-flight workspace-member reinstall settle. - Until yesterday's main merge (4f1a301), no commit had touched the ``prime-rl-configs`` source in a while, so uv's editable cache stayed warm and the reinstall was effectively a no-op. Yesterday's merge brought in 083127f ("chore(config): remove unused sampling args") which modified ``packages/prime-rl-configs/src/prime_rl/configs/orchestrator.py``. That bumped the workspace member's source, so uv decided every fresh-shell sync needs to rebuild + reinstall. Combined with the removal of ``pre_run_command`` (commit 982468c on configs/private, which removed the accidental buffer to fix a separate install race), the workspace-member race got newly exposed on the GLM-4.5-Air launches. Fix: add ``--no-install-package prime-rl-configs --no-install-package prime-pydantic-config`` to the sbatch ``uv sync`` in all five templates. Tells uv: don't touch those packages on this sync — they're already installed via the launcher's earlier sync, propagated to compute nodes via shared beegfs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… prime-pydantic-config" This reverts commit 4bac4cd.

…nv installs) Pulls in: - 7a9edd8 Revert "forth-lang+deepdive: drop pre_run_command — start.sh's install is sufficient" - a031363 forth-lang+deepdive: switch back to pre_run_command for env installs Compute nodes install the four env packages via pre_run_command; start.sh no longer does ``uv pip install -e ...`` on the launcher. Companion to the prime-rl revert of the sbatch template --no-install-package fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… GB) cbfa17c — typo fix on the GLM-4.5-Air ``kv_cache_offload.cpu_bytes``. 960 GB was a likely extra-digit slip (7.5× the GLM-5.1 cells' 128 GB on a smaller model). Suspected cause of the recent launch's 2h40m step-0 + 94% ``sampling_args=None`` rollout failure rate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

fec539c — doubles GLM-4.5-Air inference capacity (2 → 4 infer nodes, 4 replicas with tp=8). Combined with the KV-offload typo fix in the previous bump, step-0 should be substantially faster. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…g 4) 3f3c5c1 — halves training batch, raises oversampling factor to 4 (matches qwen fl_* pattern). Same total in-flight rollouts as before (256 × 4 = 1024) but more rollouts per example for the filters to reject. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…g 8, per-tier evals)

Single env-wide temperature per rollout — assume the config surface (one [orchestrator.train.env.sampling] block per env), drop the output -> trajectories round-trip: - Drop `sampling_args` from REQUIRED_STATE_COLUMNS. The orchestrator already knows each env's temperature from its TrainEnv.sampling_args, so there's no reason to require envs to mirror it back through state. This also unblocks v1 envs, which route sampling_args through state["runtime"] and don't surface it at the top level. - interleave_rollout no longer reads output["sampling_args"]["temperature"] or fills `completion_temperatures`; it leaves the field empty. - The orchestrator fans the env's scalar temperature out across each sample's completion tokens in the existing post-process loop (where advantage / reward / env_name / training_mode are already stamped), before constructing the TrainingBatch. TrainingSample / TrainingBatch wire format is unchanged. Trainer-side per-token temperature scaling (scaled_logits = logits / temps) keeps working as-is. Tests: update tests/unit/orchestrator/test_trajectories.py to assert `completion_temperatures == []` post-interleave (the fan-out happens in the orchestrator, not exercised by these tests). Co-authored-by: Cursor <cursoragent@cursor.com> (cherry picked from commit 39c8d29)

…6→8)

snimu and others added 30 commits May 19, 2026 18:22

chore(configs): bump configs/private (rpx=16 for ECHO parity)

04d995f

chore(configs): bump configs/private (start.sh-style sweep wrapper)

1bd5ff3

chore(configs): bump configs/private (PATH fix for non-login SSH)

565ea6b

Revert "fix(sft): clear sft_mask without sft_alpha + missing-attr han…

9cde634

…dling" This reverts commit 6d831cd.

Revert "fix(loss): split SFT and RL loss paths to remove double norma…

acc3365

…lization" This reverts commit bc01f94.

chore(configs): bump configs/private (none-only sweep launcher)

8b7ab3d

chore(configs): bump configs/private (3 nodes/run, 2 infer replicas)

767df36

chore: remove accidentally committed .DS_Store

4a72045

snimu and others added 30 commits May 26, 2026 19:19

chore(configs): bump configs/private for nfpt forth-lang variants

59d45c4

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

uv lock

386ed5c

uv.lock

36c705b

Revert "fix(slurm): extend --no-install-project to prime-rl-configs +…

3628fe6

… prime-pydantic-config" This reverts commit 4bac4cd.

chore(configs): bump configs/private (glm45air batch 128, oversamplin…

f2ff7b6

…g 8, per-tier evals)

chore(configs): bump configs/private (glm45air rollouts_per_example 1…

7789b3c

…6→8)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/sft on tool outputs#2625

Feat/sft on tool outputs#2625
snimu wants to merge 78 commits into
mainfrom
feat/sft-on-tool-outputs

snimu commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

snimu commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants