Skip to content

exp: verifiers v1 smoke configs#2637

Draft
mikasenghaas wants to merge 3 commits into
mainfrom
exp/verifiers-v1
Draft

exp: verifiers v1 smoke configs#2637
mikasenghaas wants to merge 3 commits into
mainfrom
exp/verifiers-v1

Conversation

@mikasenghaas
Copy link
Copy Markdown
Member

@mikasenghaas mikasenghaas commented May 26, 2026

Summary

Experiment branch for the verifiers v1 path on prime-rl. Rebased onto main after #2635 (typed RendererConfig) landed, so the diff is now just the bits unique to this branch.

Submodule pins

  • deps/verifiersa64e5f90 (v0.1.15.dev11 release tag). Sits before #1414, which stuffs a non-JSON-serializable RenderedTokens into trajectory state and breaks v1's state.assert_serializable(). Still contains #1462 (TasksetConfig rework) and #1467 (typed RendererConfig).
  • deps/research-environments6f2bfeded (head of PrimeIntellect-ai/research-environments#360, feat/wikispeedia-v1 merged with latest main). Adds wikispeedia to the envs extra and uv workspace.

Smoke configs (2-GPU, Qwen3-4B-Instruct-2507 unless noted)

config env id harness tools notes
configs/reverse_text/v1.toml reverse-text (v1=true) vf.Harness
configs/alphabet_sort/v1.toml alphabet-sort (v1=true) vf.Harness multi-turn (min/max_turns=2)
configs/wikispeedia/rl_qwen3_4b.toml wikispeedia vf.Harness click_link, go_back
configs/wikispeedia/rl_qwen3_4b_rlm.toml wikispeedia vf.RLM (via harness.id) same matches the general-agent v1 harness dispatch in research-environments#395
configs/general_agent/v1.toml general-agent vf.Harness depends on research-environments#395 landing (will load as v0 until then)

Single env id, single load_environment; harness selection is config-driven.

Temperature plumbing refactor

Single env-wide temperature per rollout — assume the config surface (one [orchestrator.train.env.sampling] block per env), drop the env → orchestrator → trainer round-trip via output["sampling_args"]:

  • Drop sampling_args from REQUIRED_STATE_COLUMNS. The orchestrator already knows each env's temperature from its TrainEnv.sampling_args, so there's no reason to require envs to mirror it back through state. This also unblocks v1 envs, which route sampling_args through state["runtime"] and don't surface it at the top level.
  • interleave_rollout no longer reads output["sampling_args"]["temperature"] or fills completion_temperatures; it leaves the field empty.
  • The orchestrator fans the env's scalar temperature out across each sample's completion tokens in the existing post-process loop (where advantage / reward / env_name / training_mode are already stamped), before constructing the TrainingBatch.

TrainingSample / TrainingBatch wire format is unchanged. Trainer-side per-token temperature scaling (scaled_logits = logits / temps) keeps working as-is.

Verification

End-to-end RL smokes on a 2× RTX PRO 6000 box (single-node, 1 inference + 1 trainer GPU):

config model steps result
examples/reverse_text/rl.toml (v0 baseline) Qwen3-0.6B-Reverse-Text-SFT 20 Step 0 reward 0.1427 → Step 19 reward 0.7546
examples/reverse_text/rl.toml w/ args = { v1 = true } same 20 Step 0 0.1411 → Step 19 0.7748 (v0 ≈ v1)
configs/alphabet_sort/v1.toml Qwen3-4B-Instruct-2507 2 Step 0 reward 0.7093, Step 1 0.6649, Orchestrator finished
configs/general_agent/v1.toml Qwen3-4B-Instruct-2507 2 Runs cleanly (needs research-environments#395 checked out locally to exercise the v1 harness)
  • uv run pytest tests/unit/orchestrator tests/unit/train -q — 205 passed, 4 pre-existing GPU-env failures (tests/unit/train/models/test_qwen3_5_moe* need CUDA visible to pytest; fail identically on origin/main).
  • uv run pytest tests/unit/orchestrator/test_trajectories.py -q — 20 passed (assertions updated to reflect that completion_temperatures is filled post-interleave by the orchestrator, not by interleave_rollout).
  • uv run rl @ configs/<each>.toml --dry-run — all 5 smoke configs resolve cleanly.

Notes

  • reverse-text-rlm is intentionally omitted: the upstream v1 reverse_text_v1.load_environment hardcodes vf.Harness (no harness.id dispatch yet), so swapping in vf.RLM would require an upstream patch.
  • The verifiers pin sits before #1414. When upstream either makes RenderedTokens JSON-serializable or filters it out before state.assert_serializable(), we can bump to latest main again.
  • wiki-search v1 is not included as a smoke config: its v1 entry calls state.get_endpoint_config(api="chat") and hands the result to a plain AsyncOpenAI chat client, but prime-rl's rollout endpoint is the renderer-client interception server which expects per-request rollout IDs. Plain chat calls 404 with "Rollout not found". That's a separate v1 chat-endpoint plumbing gap for a follow-up.
  • Depends on PrimeIntellect-ai/research-environments#360 landing (or being re-pinned) for the wikispeedia configs. Submodule pointer is on the PR head until then.
  • The general-agent v1 config depends on PrimeIntellect-ai/research-environments#395 landing for the v1 port. Until then, id = "general-agent" resolves to the v0 entry point.

@mikasenghaas mikasenghaas changed the title exp: verifiers v1 smoke configs (reverse-text + wikispeedia, default + RLM) exp: verifiers v1 smoke configs (reverse-text + wikispeedia, default + RLM via harness.id) May 26, 2026
@mikasenghaas mikasenghaas changed the title exp: verifiers v1 smoke configs (reverse-text + wikispeedia, default + RLM via harness.id) exp: verifiers v1 smoke configs May 26, 2026
mikasenghaas and others added 3 commits May 26, 2026 20:06
… + add wikispeedia

- deps/verifiers -> a64e5f90 (v0.1.15.dev11 release tag). Sits *before*
  #1414 (per-token prompt attribution to TrajectoryStep), which stuffs a
  non-JSON-serializable RenderedTokens into trajectory state and breaks
  v1's state.assert_serializable(). The pin still contains #1462
  (TasksetConfig rework) and #1467 (typed RendererConfig).
- deps/research-environments -> 6f2bfeded (head of
  PrimeIntellect-ai/research-environments#360, feat/wikispeedia-v1 +
  origin/main merged): pulls in the wikispeedia v1 port with a CLI-
  configurable harness.
- Add `wikispeedia` to the `envs` extra and the uv workspace so the env
  resolves through `uv run`.

Co-authored-by: Cursor <cursoragent@cursor.com>
Five minimal 2-GPU smoke configs covering the v1 envs we want to test
on this branch. All use `args = { v1 = true }` or v1-shaped EnvConfig.

- configs/reverse_text/v1.toml — `reverse-text` env, in-process
  vf.Harness, no tools. Qwen3-4B-Instruct-2507.
- configs/alphabet_sort/v1.toml — `alphabet-sort` env, multi-turn
  (min_turns=max_turns=2), in-process vf.Harness. Qwen3-4B.
- configs/wikispeedia/rl_qwen3_4b.toml — `wikispeedia` env, in-process
  vf.Harness with click_link/go_back tools.
- configs/wikispeedia/rl_qwen3_4b_rlm.toml — same `wikispeedia` env id,
  harness swapped to RLM via `[orchestrator.train.env.args.config.harness]
  id = "verifiers.v1.packages.harnesses.rlm"`. Matches the harness-
  dispatch pattern from research-environments#395.
- configs/general_agent/v1.toml — `general-agent` env (depends on
  research-environments#395 — currently lands as v0 until #395 ships).
  In-process vf.Harness, ac freq=1 to keep the 4B trainer in memory.

Co-authored-by: Cursor <cursoragent@cursor.com>
Single env-wide temperature per rollout — assume the config surface
(one [orchestrator.train.env.sampling] block per env), drop the
output -> trajectories round-trip:

- Drop `sampling_args` from REQUIRED_STATE_COLUMNS. The orchestrator
  already knows each env's temperature from its TrainEnv.sampling_args,
  so there's no reason to require envs to mirror it back through state.
  This also unblocks v1 envs, which route sampling_args through
  state["runtime"] and don't surface it at the top level.
- interleave_rollout no longer reads output["sampling_args"]["temperature"]
  or fills `completion_temperatures`; it leaves the field empty.
- The orchestrator fans the env's scalar temperature out across each
  sample's completion tokens in the existing post-process loop (where
  advantage / reward / env_name / training_mode are already stamped),
  before constructing the TrainingBatch.

TrainingSample / TrainingBatch wire format is unchanged. Trainer-side
per-token temperature scaling (scaled_logits = logits / temps) keeps
working as-is.

Tests: update tests/unit/orchestrator/test_trajectories.py to assert
`completion_temperatures == []` post-interleave (the fan-out happens
in the orchestrator, not exercised by these tests).

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant