Skip to content

Feat/sft on tool outputs#2625

Draft
snimu wants to merge 78 commits into
mainfrom
feat/sft-on-tool-outputs
Draft

Feat/sft on tool outputs#2625
snimu wants to merge 78 commits into
mainfrom
feat/sft-on-tool-outputs

Conversation

@snimu
Copy link
Copy Markdown
Collaborator

@snimu snimu commented May 25, 2026

No description provided.

snimu and others added 30 commits May 19, 2026 18:22
Adds the per-env ``SFTConfig`` (on_tool_outputs / alpha / tool_names)
under ``TrainEnvConfig.sft`` and a global ``disable_echo`` knob on
``DefaultLossConfig``. Purely additive — both default to off, so this
commit has no runtime effect on existing configs.

The follow-up commits wire the actual SFT-on-tool-body objective:
per-token mask construction in the orchestrator, advantage overlay
in the trainer, and IS-ratio / DPPO / KL gating in the loss.
Two additive fields on TrainingSample (sft_mask: list[bool] | None,
sft_alpha: float | None) and one on MicroBatch (sft_mask).

Distinct from the existing ``sft_loss: bool`` flag, which switches
the whole sample to the standalone ``sft_loss_fn``. ``sft_mask``
co-exists with the RL loss: the trainer overlays alpha/n on
sft_mask positions of the advantages tensor while leaving assistant
tokens under their normal RL advantage. Both fields default to
None — zero serialization cost for samples whose env doesn't use SFT.

Follow-up commits wire mask construction (orchestrator side) and
advantage overlay + loss gating (trainer side).
Adds a pure helper ``_step_sft_mask`` that, given the renderer's
``prompt_attribution`` + verifiers' ``prompt_message_tool_names`` for
one trajectory step and the env's ``SFTConfig``, returns a per-token
bool mask: True iff the token is the body (``is_content=True``) of a
tool-role message whose function name is in the env's allowlist
(``tool_names``, or any tool when None). Completion-side entries
uniformly False; all-False fallback when SFT is off / attribution is
missing.

``interleave_rollout`` gains an ``sft_config`` parameter. Each step's
mask is computed once in ``prepare_step_tokens`` and carried into
``TrainingSample.sft_mask`` via ``make_sample`` / ``extend_sample``
(the merge path extends the mask in lockstep with completion_mask:
the new prompt tail can carry tool-body tokens, the new completion
is uniformly False since the model never samples tool tokens).

The orchestrator looks up the per-env SFTConfig via
``train_envs.get(env_name).config.sft`` at the call site, passes it
into ``interleave_rollout``, and attaches the per-env ``alpha`` onto
``sample.sft_alpha`` at sample finalization (right next to where
``advantage`` and ``env_name`` are set).

Trainer-side overlay + loss gating come in follow-up commits.
Threads ``sft_mask`` + ``sft_alpha`` through the per-sample prepare /
packing / padding path:

- ``prepare_sample`` rewrites the per-token advantage to
  ``alpha / n_sft_tokens`` (or ``alpha`` if ``disable_echo``) on mask
  positions and flips them into ``loss_mask=True`` so they contribute
  to the loss. The constant per-rollout normalization keeps long
  rollouts from dominating short ones on the SFT term.
- Truncation / packing / padding extend ``sft_mask`` in lockstep with
  ``loss_mask``. A bin that previously had no SFT signal materializes
  the all-False prefix lazily when the first SFT sample joins.
- ``_make_dummy_batch`` zeros ``sft_mask`` alongside ``advantages``
  and ``loss_mask`` so distribution padding never accidentally pulls
  SFT gradient.
- ``prepare_batch`` gains a ``disable_echo`` kwarg; train.py wires it
  from ``trainer.loss.disable_echo`` in the next commit.

Length-equality assertion extended to cover ``sft_mask``. Loss-side
gating (force IS ratio = 1, skip DPPO trust-region clipping, zero KL
on SFT positions) lands in the next commit.
Three surgical edits inside ``default_loss_fn``, gated on the new
``LossInputs.sft_mask`` (None when the rollout's env has SFT off):

- Force ``log_importance_ratio = 0`` on SFT positions → importance_ratio=1
  and mismatch_kl=0. The downstream ``kl_loss = loss_mask *
  log_importance_ratio**2`` term is then zero on these tokens by
  construction; no separate KL exclusion needed.
- Force ``dppo_invalid_mask`` to False on SFT positions. The trust-region
  check compares trainer vs inference logprobs, but the inference logprob
  for prompt tokens is the placeholder 0.0 → ``probs_diff ≈
  exp(trainer_logprob) - 1`` and would silently mask the SFT gradient out
  every step. Excluding SFT positions from the check is what makes the
  overlay actually train.
- New metrics ``sft_nll_mean`` / ``sft_nll_max`` / ``sft_token_count``
  computed over ``loss_mask & sft_mask``. The world-model loss curve and
  per-batch spike detector.

No changes for rollouts whose env has SFT off — sft_mask=None makes the
new branches no-ops and the metrics absent from the output dict.
End-to-end plumbing so the SFT-on-tool-body overlay actually reaches
the GPU loss:

- ``compute_loss`` accepts a parallel ``sft_mask`` list and threads
  each rollout's mask into the corresponding ``LossInputs``. Default
  ``None`` matches the old call shape, so non-SFT runs are byte-identical.
- ``BasePacker`` (and ``SinglePacker`` / ``MultiPacker`` / ``setup_packer``)
  carry ``disable_echo`` so ``prepare_batch`` knows whether to apply the
  ECHO length normalization. ``DataLoader.__init__`` plumbs it through
  to the packer factory; ``train.py`` reads it off ``config.loss``
  (defensive ``getattr`` since SFTLossConfig / CustomLossConfig don't
  carry the flag).
- ``TensorMicroBatch`` gains an ``sft_mask`` field; the
  ``MicroBatch → TensorMicroBatch`` conversion materializes it onto a
  bool tensor (with the standard ``unsqueeze(0)`` batch dim) when present.
- ``train.py:compute_loss`` call site reads the per-batch mask, moves
  it to CUDA, ``split``s it on response_lengths, and passes through.

Conservative test additions (per AGENTS.md: only pure data
transformations):

- ``test_step_sft_mask_*`` (8 cases) exercising the orchestrator-side
  mask construction: SFTConfig None / disabled / missing attribution
  / missing names → all-False; tools=None → all body of all tools;
  tools=[...] → filtered; non-content tokens stay False; completion
  uniformly False.
- ``test_prepare_sample_overlays_sft_advantage_*`` (4 cases) on the
  advantage overlay: length-normalized weight, disable_echo constant
  weight, no-op when alpha is missing, truncation slices the mask in
  lockstep.
…anches

Pins ``deps/renderers`` at the tip of ``sebastian/content-mask-2026-05-19``
(adds ``is_content`` to RenderedTokens + surfaces ``prompt_attribution``
on generate()).

Pins ``deps/verifiers`` at the tip of
``sebastian/prompt-tool-names-2026-05-19`` (stacked on
``sebastian/renderers-pass-through-info-2026-05-19``): carries
``prompt_attribution`` through ResponseTokens → TrajectoryStepTokens
and adds per-message ``prompt_message_tool_names`` lookup.

The orchestrator + trainer changes on this branch consume both APIs.

NOT FOR MERGE — these branches aren't pushed to origin yet, so the
submodule SHAs are only reachable via the local working trees here.
For training runs that clone fresh, push the upstream branches first.
Records ``branch = sebastian/<feature>`` on both submodules so a fresh
clone followed by ``git submodule update --init --remote`` tracks the
SFT-on-tool-body upstream work end-to-end.

The branches are now on origin (no longer local-only):
- renderers: sebastian/content-mask-2026-05-19
- verifiers: sebastian/prompt-tool-names-2026-05-19  (stacked on
             sebastian/renderers-pass-through-info-2026-05-19)

NOT FOR MERGE — when the upstream PRs land on main, drop these
``branch`` directives and re-pin both submodules to main.
The SFT-on-tool-body overlay computed weight = alpha / n_sft_tokens.
That's not the ECHO objective — ECHO normalizes by the total rollout
length, so the per-rollout SFT loss contribution scales as
alpha × (n_sft_tokens / total_rollout_length), proportional to how
much of the rollout was tool body. The old formulation gave every
rollout a constant alpha total SFT contribution regardless of how
much tool material it actually had, biasing updates toward
sparse-SFT rollouts (each token getting a larger share).

In practice the two are off by a factor of (n_sft / total_length).
For a rollout that's 30% tool body, the old impl applied ~3.3x the
intended SFT weight; for 5% tool body, ~20x.

Fix: change ``weight = sft_alpha / n_sft`` →
``weight = sft_alpha / len(input_ids)``. Same shape on the loss_mask
flip, IS=1 forcing, KL exclusion, etc — only the per-token magnitude
changes. ``disable_echo=True`` is unaffected (still constant alpha).

Updates to keep contracts consistent:
- src/prime_rl/trainer/batch.py: the actual logic + docstring.
- src/prime_rl/transport/types.py: comment on TrainingSample.sft_alpha.
- src/prime_rl/trainer/rl/packer.py: comment on disable_echo plumbing.
- packages/.../orchestrator.py: SFTConfig docstring.
- packages/.../trainer.py: DefaultLossConfig.disable_echo description.
- tests/unit/orchestrator/test_batch.py: test fixture expectations
  (n_sft=2, alpha=0.5 → was 0.25, now 0.125 since total_length=4).

Implication for in-flight Forth runs: they were started under the old
(too-hot) normalization with alpha=0.5. They keep their original
behavior — those runs are now "high-alpha runs" by accident, and the
correctly-normalized variant becomes a separate experimental cell
once we re-tune alpha against the fixed formulation.

Smoke-verified locally against three fixtures (4-token + 10-token
inputs, ECHO vs disable_echo) — total SFT contribution scales as
alpha × (n_sft / total_length) as expected.
Replace the trainer-level boolean ``disable_echo`` flag with a per-env
``SFTConfig.normalization`` Literal so each env can pick how the
SFT-on-tool-body advantage weight scales with rollout shape. Four modes
(notation: α = SFTConfig.alpha, T = total rollout length,
S = n_sft_tokens, R = n_rl_tokens = sum(completion_mask)):

  all_tokens (default, ECHO):  weight = α / T   → total = α × (S/T)
  sft_tokens:                  weight = α / S   → total = α (constant)
  ratio:                       weight = α × R/S → total = α × R
  none:                        weight = α        → total = α × S

Motivation: experimental evidence shows the right α depends on the
ratio of environment-tokens to model-tokens. ``ratio`` calibrates the
SFT contribution against the RL signal magnitude — when SFT is rare
(sparse tool outputs), per-SFT-token weight goes up; when SFT is dense
(most of the rollout is tool body), it goes down.

All counts are taken pre-truncation from the rollout's TrainingSample
so the weight reflects the actual rollout shape, not the artifact of
seq_len truncation.

Touch points:

- SFTConfig: add normalization field with full docstring spelling out
  the four formulas.
- DefaultLossConfig: drop disable_echo entirely (was a global flag;
  per-env normalization replaces it).
- TrainingSample: add sft_normalization transport field; the
  orchestrator populates it from per-env SFTConfig alongside sft_alpha.
- prepare_sample: dispatch on training_example.sft_normalization;
  defaults to "all_tokens" when None (belt-and-braces against older
  rollouts). Unknown modes raise ValueError loudly.
- packer.py / data.py / train.py: remove all disable_echo plumbing
  (no longer global).
- Tests: replace the two existing SFT-overlay tests with five — one
  per mode (all_tokens, sft_tokens, ratio, none), one for the
  None→all_tokens fallback, and one pinning the unknown-mode
  ValueError path.

Migration footgun (for the live Forth runs): they're operating on the
pre-fix code (effectively "sft_tokens" normalization) and their
configs don't set the new field. Restarting from the new branch will
default to "all_tokens" — different normalization, different effective
α. Either update the configs explicitly to "sft_tokens" before
restart, or treat any restart as a new experimental cell. The
in-progress runs themselves are unaffected (they're on the old
branch).

Smoke-verified locally against all four modes plus the unknown-mode
error path and the ratio R=0 degenerate case.
)

PR #53 (per-token is_content mask for body/scaffold attribution) was
merged to renderers/main at 1691f87. Update the submodule pointer
from the now-unneeded sebastian/content-mask-2026-05-19 feature branch
to main, and drop the `branch = sebastian/...` directive from
.gitmodules so a fresh `git submodule update --init --remote` walks to
main's tip by default.

deps/renderers: 281d89b (sebastian/content-mask-2026-05-19)
             →  1691f87 (origin/main, includes the merged PR plus
                         #54 routed-experts, #55 kimi tool schema,
                         #56 idna bump, #57 fastokens bump)

Verified is_content infrastructure is present in base.py, qwen3.py,
qwen3_vl.py, qwen35.py, kimi_k25.py — matches what was on the feature
branch. No code change needed elsewhere; the renderers API surface
is the same.

Note: deps/verifiers is still pinned to
sebastian/prompt-tool-names-2026-05-19 — the verifiers PRs haven't
landed yet. Same treatment whenever they do.
* fix(wandb): handle shared mode breakage from wandb v0.26.1

wandb v0.26.1 (PR #11759) changed `inform_init` to propagate
duplicate stream ID errors as `ServerResponseError` instead of
silently swallowing them. This breaks shared mode where multiple
processes intentionally call `wandb.init(id=same_run_id)`.

- Add `resume="allow"` when a run ID is provided so wandb attaches
  to the existing run instead of rejecting it
- Catch `ServerResponseError` (alongside `CommError`) for non-primary
  shared-mode processes to handle transient init races

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(wandb): teardown wandb-core between init retries

A failed wandb.init leaves the run_id registered in the local
wandb-core StreamMux. The next retry hits the same wandb-core,
finds the stream still there, and raises ServerResponseError
("run ID ... is in use") regardless of resume="allow" — the check
fires before reaching the server.

Tearing down the service before sleeping clears the StreamMux so
the retry starts fresh. Also catch ServerResponseError on the
primary side, since the race is symmetric: either side can win
the upsertBucket and the other can hit this path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…2589)

* feat(orchestrator): per-env state_columns for extra rollout fields

Adds `state_columns: list[str] = []` to `EnvConfig` so each env can
persist additional `State` fields into the saved JSONL rollouts on top
of the always-saved `trajectory` and `sampling_args`. Merged at the
call site (required first, deduped).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor: drop seen set from state_columns dedup

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ion)

Picks up the forth-lang configs/private branch (feat/sft-on-tool-outputs-forth-lang)
with the disable_echo → SFTConfig.normalization migration. Pairs the
qwen + glm45 cmb-*-ct.toml cells with `normalization = "sft_tokens"` and
the qwen *-no-norm.toml ablations with `normalization = "none"` —
byte-identical mapping to the disable_echo era. Required for the next
batch of forth-lang experiments (the `all_tokens` and `ratio` cells
will be added in follow-up commits on the same configs/private branch).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
default_loss_fn previously routed SFT positions through importance_ratio
which was forced to 1 via:

  log_importance_ratio = torch.where(sft_mask, torch.zeros_like(...), ...)
  importance_ratio = torch.exp(log_importance_ratio)
  pg_loss = keep_mask * advantages * importance_ratio

torch.where(cond, zeros_like(x), x) blocks the gradient through ``x`` on
the True branch — so on SFT positions importance_ratio became a constant
1 with no gradient w.r.t. trainer_logprobs, and pg_loss became a constant
``advantages``. The SFT gradient was silently zero. Combined runs we
believed were SFT-on-tool-body were effectively pure RL.

Fix routes SFT positions through ``advantages * trainer_logprobs``
directly so the gradient w.r.t. trainer_logprobs equals the per-token
SFT weight (overlaid in prepare_sample by the chosen normalization
mode). After the negation in ``loss = -pg_loss.sum()``, this gives
parameter updates in the +∇log p direction on SFT positions — pure SFT.

RL positions are unchanged: still use ``keep_mask * advantages *
importance_ratio`` with the trust-region keep_mask. The log_importance_ratio
zeroing on SFT positions stays in place for the KL term (kl_loss = 0 on
SFT positions is still correct).

Regression test in tests/unit/train/rl/test_loss.py:
``test_default_loss_fn_sft_mask_gradient_flows`` asserts the gradient on
SFT positions is non-zero and equals -advantage[s].

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Picks up configs/private 03fd21f — drops the old forth-lang TOMLs (which
were running against the silent SFT-gradient bug fixed in the previous
commit), and adds:

- forth-lang: qwen-rl + 4 cmb-code cells (one per normalizer) + sweep/
- deepdive:   qwen-rl + 4 cmb cells (no tool_names filter) + sweep/
- alpha_sweep_launcher.sh: 7 αs × 4 norms × 2 envs = 56 runs, paired
  with this branch's loss fix so the runs actually exercise SFT.

Topology: 2 nodes / bs=256 throughout.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t test

``.cuda()`` after ``requires_grad=True`` makes the device tensor a
non-leaf, so ``.grad`` never populates after backward(). Switch to
``device="cuda"`` in the constructor so it's a leaf on the GPU.

Caught when running the test on the cluster against the gradient fix.
prompt_attribution arrives at the orchestrator as a plain dict through
the verifiers env-server JSON boundary, even though the renderer emits
a RenderedTokens object. The previous code assumed attribute access
(``prompt_attribution.message_indices``), which crashed on the cluster:

    AttributeError: 'dict' object has no attribute 'message_indices'

Handle both forms — getattr fallback for the object case, dict.get for
the wire-serialized case. Per the function's existing precondition (no
runtime type check on ``prompt_attribution: Any``), this is the right
contract: the SFT mask should be computable from whatever shape the
caller passed in as long as the data is present.

Caught when the cluster smoke test crashed at step 0 rollout generation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
prompt_attribution always arrives as a plain dict through the verifiers
env-server JSON boundary — the renderer's RenderedTokens object loses
its identity in serialization. Drop the prior defensive isinstance/
getattr branching and just use dict access. Updates the type annotation
on the parameter to ``dict | None`` to match.
The previous design flipped SFT positions into ``loss_mask`` and routed
everything through ``default_loss_fn``. That had two problems:

1. Double normalization on SFT. SFT positions were divided by the per-
   rollout mode weight (α/X, set in ``prepare_sample``) AND by the
   batch-level ``loss_scale`` (= total loss_mask tokens, now including
   SFT). The per-SFT-token gradient ended up at α/(X × loss_scale), an
   ~S-times smaller magnitude than the legacy ``sft_loss_weight / N_rl``
   regime the production α=0.5 was tuned against.

2. Gradient flow on SFT positions went through ``torch.where(sft_mask,
   zeros, log_importance_ratio)``, which blocks the gradient through
   ``trainer_logprobs`` on the True branch. The intent (force IS-ratio
   to 1 for off-policy correction skip) silently disabled the SFT
   signal entirely.

Restructure to match the shape ECHO's algorithm actually has:
- RL term: batch-level normalization by N_rl_batch (the existing
  ``loss_scale``, now RL-only again).
- SFT term: per-rollout normalization by the chosen mode (α, α/S, α/T,
  α×R/S — set in ``prepare_sample`` on the advantage tensor). No batch
  normalization on top.

Implementation:
- ``prepare_sample``: stop flipping SFT into ``loss_mask``. Keep setting
  ``advantage[sft_pos] = mode_weight``.
- ``LossInputs.sft_mask`` removed — no longer needed since SFT and RL
  flow through separate ``compute_loss`` calls.
- ``default_loss_fn``: RL-only. All ``sft_mask``-conditional branches
  dropped.
- New ``sft_pg_loss_fn``: ``loss = -(advantages * trainer_logprobs *
  loss_mask).sum()``. Caller passes the SFT mask as ``loss_mask`` and
  ``loss_scale=1`` so no batch division.
- ``train.py``: second ``compute_loss`` call after the RL one, sharing
  the same ``out["logprobs"]`` tensor — single forward + single backward
  covers both terms.
- Regression test: replaced the broken ``test_default_loss_fn_sft_mask_
  gradient_flows`` with ``test_sft_pg_loss_fn_gradient_magnitude`` that
  exercises the new function directly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three follow-ups from review of the SFT/RL split:

1. prepare_sample now clears sft_mask to None when sft_alpha is unset.
   Previously sft_mask propagated downstream regardless, and train.py
   would run sft_pg_loss_fn over the rollout's raw RL advantage on
   tool tokens — reward-shaped, not SFT-direction. With the mask
   cleared, the trainer skips the SFT compute_loss call as intended.

2. _step_sft_mask now bails to all-False when prompt_attribution lacks
   "message_indices" or "is_content" keys, matching the docstring's
   "missing/partial attribution returns all-False" contract. The
   DefaultRenderer leaves these unpopulated; callers shouldn't have
   to special-case the no-attribution path.

3. test_batch.py fixtures updated for the new design: SFT prompt
   tokens stay out of loss_mask (= [False, False, True, True] for the
   2-prompt + 2-completion fixture). Added an assertion that
   prepare_sample clears sft_mask to None when sft_alpha is missing.

Plus a docstring update in batch.py clarifying that "all_tokens" mode
(α/T) is NOT exactly ECHO — the paper uses Z=|𝒪|, which matches
"sft_tokens" when no tool_names filter is set. "all_tokens" with T
(prompt + completion) is strictly stricter.
Per team decision: SFT-on-tool-body is RL credit assignment with a flat
positive advantage on tool body positions. Drops the four-mode
normalization machinery entirely:

- SFTConfig: removed ``normalization`` field. Keep on_tool_outputs,
  alpha (default 1.0), tool_names.
- TrainingSample: removed sft_normalization field.
- prepare_sample: collapsed the 4-mode dispatch to just
  ``advantages[k] = sft_alpha`` on SFT-mask positions.
- orchestrator.py: drop sft_normalization assignment.
- train.py: drop the stale doc comment about per-rollout mode dispatch.
- tests: replaced 4-mode parametric tests with one ``advantage = alpha``
  assertion.

default_loss_fn keeps the torch.where SFT branch (advantage * trainer_logp
on SFT positions, log_IR forced to 0 for KL = 0 on SFT positions) — the
mechanism is correct under the RL-credit-assignment framing, just no
longer dispatched over four modes.

Submodule bump picks up the matching configs/private cleanup.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
snimu and others added 30 commits May 26, 2026 19:19
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Submodule conflicts in configs/private and deps/verifiers resolved by
keeping the feature branch's pointers (ours). The submodule contents
themselves will be updated to incorporate main's submodule-side
changes in a follow-up commit.
Submodule now points at 43016ea6 (verifiers main tip — "Set routed
experts replay start for bridged prompts (#1466)" plus the rest of
the post-merge work that landed since the SFT-on-tool-body feature
branch was forked).

The merged-into-main version of the renderer-attribution PR (#1414,
the squash of the earlier feature commit ``5171ae9b``) provides the
existing ``TrajectoryStepTokens.prompt_attribution`` sidecar. The
new ``RenderedTokens.message_tool_names`` field added in the
renderers PR will surface through that sidecar automatically once
the renderers pin is bumped; no further verifiers-side changes are
needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pins deps/renderers at d7c160b on `sebastian/tool-names-2026-05-28`,
which adds ``RenderedTokens.message_tool_names`` — a per-message
sidecar parallel to ``message_roles`` carrying the tool function
name for each tool-role message in the rendered slice. Filled from
``msg["name"]`` when set, otherwise via the
``tool_call_id → assistant.tool_calls[i].function.name`` join.

This is the metadata source the trainer-side SFT-on-tool-body loss
reads (via ``prompt_attribution["message_tool_names"]`` on each
``TrajectoryStepTokens``). Repoint to renderers main once the
upstream PR lands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…h-lang

Switches the submodule to the feature-branch tip (25fc6a61) that
carries the forth-lang + deepdive sweep work for this branch:
clean-slate alpha-sweep matrix, deepdive prod cells, qwen ac_offloading
fixes, launcher polling slurm + start.sh wrapping, num_workers=32 on
forth-lang-test eval blocks, and the migration to SFT four-mode →
single-mode (RL credit assignment) normalization.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…05-11

Switches deps/research-environments from sebastian/forth-env (the
prior synth-heavy general-agent branch, 995e1a6f) to the
forth-lang v1-migrated branch (6fe7c665), which includes:

- forth_lang migrated to verifiers v1 Taskset/Harness (#406)
- forth_lang bumped to v0.2.0 — T0-T5 tier scheme, 419 rows
- Sandbox image v3 (root-owned /opt/forth-lang, docs prebuilt)
- run_filter wired to the real word_to_call task filter
- The just-completed merge of main into the branch — preserving
  the v1-aware test_envs.py probe path (3-way exit code: 0 =
  SingleTurnEnv, 2 = v1 vf.Env, 1 = v0 multi-turn) on top of main's
  ``_load_environment_python_code`` helper + ``shlex.quote``
  refactor.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
b6961cf adds two prod cells under
``sft-on-tool-outputs-forth-lang/``:

  - ``glm51_rl``: pure RL baseline on GLM-5.1
    (8 train + 4 infer nodes, sign_sgd, bs=128).
  - ``glm51_cmb_code_a0p05``: SFT-on-tool-body with α=0.05 on the
    code-execution tools (``run_code``, ``submit_code``); excludes
    ``lookup_docs``.

Env args use the v1-nested form
(``args.config.taskset`` / ``args.config.harness``), required after
the forth-lang v1 migration that landed in research-environments.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The metadata source for the SFT-on-tool-body mask moved out of a
sibling ``TrajectoryStepTokens.prompt_message_tool_names`` field
(that was going to land via the dropped verifiers ``cbc191d2``) and
onto the existing ``prompt_attribution`` sidecar as
``message_tool_names`` — auto-populated by every renderer's
``RenderedTokens`` and serialised through the verifiers env-server
via the existing ``dataclasses.asdict`` pass.

``_step_sft_mask`` now reads
``prompt_attribution["message_tool_names"]`` directly, dropping the
sibling-field parameter. Behaviour is unchanged when the field is
populated; falls through to all-False when the renderer pin
predates the new field (older trajectories, replay).

Also repairs the ``test_step_sft_mask_*`` unit tests, which were
silently broken since the production helper switched to dict
access on ``prompt_attribution`` — the ``_StubAttribution`` class
they used only supported attribute access. Replaced with a plain
dict-shaped ``_attribution`` factory that matches the serialised
production payload.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8c67f5f4b brings in:
  - forth_lang: hash-based holdout filter — three new fields
    (``holdout_fraction``, ``holdout_seed``, ``holdout_side``) on
    ``ForthLangTasksetConfig`` for deterministic, dataset-stable
    train/test splits keyed by ``word_to_call``.
  - forth_lang: ``dataset_repo`` default read at config-construction
    time (via ``Field(default_factory=...)``) so
    ``FORTH_LANG_TASKS_REPO`` works for TOML/CLI flows.
  - forth_lang: ``scripts/run_filter.py`` nests env-args under
    ``config.taskset`` / ``config.harness`` for v1 load_environment.
  - ruff formatter pass.

Additive change — existing configs that still use
``word_to_call`` / ``exclude_word_to_call`` are unaffected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
f12f6f1 brings in:
  - forth-lang: swap the 85-word hardcoded train/test holdout for
    the new hash-based ``holdout_*`` filter. 17 configs migrated;
    -542 net lines.
  - forth-lang: nest env-args under ``config.taskset`` /
    ``config.harness`` for v1 ``load_environment`` (qwen cells —
    GLM-5.1 cells already shipped nested).
  - Restores the dropped ``tiers = [N]`` filter on the per-tier
    eval entries of the ``_nfpt`` variants.

Requires deps/research-environments at 8c67f5f4b or newer (the
commit that added ``holdout_fraction`` / ``holdout_seed`` /
``holdout_side`` to ``ForthLangTasksetConfig``) — already pinned
in this branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
  - deps/research-environments → 58e98d58e: adds
    ``sandbox_labels: list[str]`` to ``ForthLangTasksetConfig``,
    merged into the default sandbox's ``labels`` at config load
    time. Lets callers tag sandboxes without overriding the full
    ``sandbox`` block.
  - configs/private → 4b9d48648: sets
    ``sandbox_labels = ["forth-lang", "<cell-name>"]`` on every
    forth-lang env (52 sections across 17 configs); adds
    ``start.sh`` to the two new GLM-5.1 cells.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…d cleanup)

ae62ccc66 brings in:
  - forth-lang+deepdive: ``uv pip install`` line in every launch
    script now includes ``-e "${DEEPDIVE_ENV_DIR}"`` (was defined
    but skipped on the launcher node, so any rollout that imports
    the ``deepdive`` env package crashed before the slurm
    pre_run_command ran). 17 scripts fixed.
  - forth-lang: drop the out-of-tree ``~/research-prod`` checkout
    from every start.sh + TOML ``pre_run_command``. The
    ``forth_lang`` env now installs from
    ``${PRIME_RL_ROOT}/deps/research-environments/`` like the
    other three envs (38 files).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
927668b drops ``max_async_level = 1`` from every config — the
field was removed from ``RLConfig`` upstream (prime-rl #2631
"hardcode async-barrier semantics") and now raises an "extra
inputs not permitted" validation error. Pure no-op (every config
already set the value to 1, which is the new hardcoded default).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Repoints deps/renderers from the feature-branch tip
``d7c160b`` to the squash-merged commit on main
``a11f8d8`` ("feat(base): add ``message_tool_names`` field for
per-message tool attribution (#74)"). Same change, now upstream.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eepdive)

b3e7d9e brings in:
  - ``fl_sanity`` cell: 10-step Qwen3-4B combined-code sanity
    config for validating the post-merge pipeline (holdout filter,
    sandbox_labels, SFT-on-tool-body wiring, deepdive install,
    ckpt + eval lifecycle) before the GLM-5.1 prod runs. Fresh
    ``output_dir``, tiny evals (5 examples each).
  - ``deepdive`` added to ``pre_run_command`` in 18 configs that
    were missing it — same root cause as the recent start.sh fix
    but on the slurm side. Without this, every rollout hitting the
    deepdive eval env would crash on import.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2823cda adds three production cells under
``sft-on-tool-outputs-forth-lang/``:

  - ``glm45air_rl``: pure RL baseline.
  - ``glm45air_cmb_code_a0p05``: SFT-on-tool-body α=0.05 restricted
    to ``[run_code, submit_code]``.
  - ``glm45air_cmb_all_a0p05``: SFT-on-tool-body α=0.05 across
    every tool (lookup_docs included via the env's default).

Based on mika's tau2-synth-glm45-air template with seq_len bumped
to 32k (fits with cp=2) and 960 GB CPU KV cache offload per
inference node. 2 train + 2 infer nodes, tp=8, muon optimizer.
Same forth-lang env shape as the existing GLM-5.1 / qwen cells.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… cells)

01880a6 sets ``[orchestrator.renderer] name = "glm-4.5"`` on the
three GLM-4.5-Air cells. ``PrimeIntellect/GLM-4.5-Air`` isn't in
``MODEL_RENDERER_MAP``, so ``"auto"`` would silently fall back to
``DefaultRenderer`` and fail preflight. Mirror the GLM-5.1 pattern.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
``deps/research-environments/environments/forth_lang/pyproject.toml``
pins ``verifiers>=0.1.15.dev9`` (published 2026-05-26). The rolling
``[tool.uv] exclude-newer = "7 days"`` window resolves to ~5 days
before that, so ``uv pip install -e ${FORTH_ENV_DIR}`` (in each
forth-lang cell's ``start.sh``) fails to resolve verifiers.

The workspace ``[tool.uv.sources] verifiers = { workspace = true }``
mapping points uv at the in-tree ``deps/verifiers`` for the main
``uv sync`` — but ``uv pip install -e <env>`` does a pip-style fresh
resolve that doesn't honour workspace sources for transitive deps.

Mirror the established pattern (cf. ``fastokens = false`` for the
analogous renderers/fastokens case): add ``verifiers = false`` to
``[tool.uv.exclude-newer-package]``.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…l race)

982468c removes ``pre_run_command`` from every forth-lang and
deepdive cell. ``start.sh``'s ``uv pip install -e ...`` runs once
on the launcher into the shared beegfs ``.venv``, which propagates
to compute nodes — the duplicate slurm-side install was racing
across the fan-out and breaking GLM-4.5-Air launches.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ydantic-config

beb1528's ``--no-install-project`` in the sbatch ``uv sync`` only
covers the root project. Workspace members declared in
``[tool.uv.workspace]`` are still uninstalled/reinstalled by every
fresh-shell sync, opening the same FileNotFoundError window for
anything that imports from them.

The orchestrator subprocess imports ``prime_rl.configs.*`` (from
``prime-rl-configs``), which in turn imports ``pydantic_config``
(from ``prime-pydantic-config``). When the sbatch fan-out has
multiple compute nodes hitting the shared beegfs ``.venv``
concurrently, the brief reinstall window on either workspace
member kills the orchestrator on whichever rank loses the race.

Why it didn't bite before:

  - Most SFT-on-tool-outputs cells had ``pre_run_command`` doing an
    ``uv pip install -e ...`` of the env packages. That added a
    couple of seconds between the sbatch ``uv sync`` and the actual
    orchestrator subprocess spawn — an accidental buffer that let
    any in-flight workspace-member reinstall settle.
  - Until yesterday's main merge (4f1a301), no commit had touched
    the ``prime-rl-configs`` source in a while, so uv's editable
    cache stayed warm and the reinstall was effectively a no-op.

Yesterday's merge brought in 083127f ("chore(config): remove
unused sampling args") which modified
``packages/prime-rl-configs/src/prime_rl/configs/orchestrator.py``.
That bumped the workspace member's source, so uv decided every
fresh-shell sync needs to rebuild + reinstall. Combined with the
removal of ``pre_run_command`` (commit 982468c on configs/private,
which removed the accidental buffer to fix a separate install
race), the workspace-member race got newly exposed on the GLM-4.5-Air
launches.

Fix: add ``--no-install-package prime-rl-configs --no-install-package
prime-pydantic-config`` to the sbatch ``uv sync`` in all five
templates. Tells uv: don't touch those packages on this sync —
they're already installed via the launcher's earlier sync, propagated
to compute nodes via shared beegfs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nv installs)

Pulls in:
  - 7a9edd8 Revert "forth-lang+deepdive: drop pre_run_command — start.sh's install is sufficient"
  - a031363 forth-lang+deepdive: switch back to pre_run_command for env installs

Compute nodes install the four env packages via pre_run_command;
start.sh no longer does ``uv pip install -e ...`` on the launcher.
Companion to the prime-rl revert of the sbatch template
--no-install-package fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… GB)

cbfa17c — typo fix on the GLM-4.5-Air ``kv_cache_offload.cpu_bytes``.
960 GB was a likely extra-digit slip (7.5× the GLM-5.1 cells'
128 GB on a smaller model). Suspected cause of the recent launch's
2h40m step-0 + 94% ``sampling_args=None`` rollout failure rate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fec539c — doubles GLM-4.5-Air inference capacity (2 → 4 infer
nodes, 4 replicas with tp=8). Combined with the KV-offload typo
fix in the previous bump, step-0 should be substantially faster.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…g 4)

3f3c5c1 — halves training batch, raises oversampling factor to 4
(matches qwen fl_* pattern). Same total in-flight rollouts as
before (256 × 4 = 1024) but more rollouts per example for the
filters to reject.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Single env-wide temperature per rollout — assume the config surface
(one [orchestrator.train.env.sampling] block per env), drop the
output -> trajectories round-trip:

- Drop `sampling_args` from REQUIRED_STATE_COLUMNS. The orchestrator
  already knows each env's temperature from its TrainEnv.sampling_args,
  so there's no reason to require envs to mirror it back through state.
  This also unblocks v1 envs, which route sampling_args through
  state["runtime"] and don't surface it at the top level.
- interleave_rollout no longer reads output["sampling_args"]["temperature"]
  or fills `completion_temperatures`; it leaves the field empty.
- The orchestrator fans the env's scalar temperature out across each
  sample's completion tokens in the existing post-process loop (where
  advantage / reward / env_name / training_mode are already stamped),
  before constructing the TrainingBatch.

TrainingSample / TrainingBatch wire format is unchanged. Trainer-side
per-token temperature scaling (scaled_logits = logits / temps) keeps
working as-is.

Tests: update tests/unit/orchestrator/test_trajectories.py to assert
`completion_temperatures == []` post-interleave (the fan-out happens
in the orchestrator, not exercised by these tests).

Co-authored-by: Cursor <cursoragent@cursor.com>
(cherry picked from commit 39c8d29)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants