exp: verifiers v1 smoke configs by mikasenghaas · Pull Request #2637 · PrimeIntellect-ai/prime-rl

mikasenghaas · 2026-05-26T00:07:23Z

Summary

Experiment branch for the verifiers v1 path on prime-rl. Rebased onto main after #2635 (typed RendererConfig) landed, so the diff is now just the bits unique to this branch.

Submodule pins

deps/verifiers → a64e5f90 (v0.1.15.dev11 release tag). Sits before #1414, which stuffs a non-JSON-serializable RenderedTokens into trajectory state and breaks v1's state.assert_serializable(). Still contains #1462 (TasksetConfig rework) and #1467 (typed RendererConfig).
deps/research-environments → 6f2bfeded (head of PrimeIntellect-ai/research-environments#360, feat/wikispeedia-v1 merged with latest main). Adds wikispeedia to the envs extra and uv workspace.

Smoke configs (2-GPU, Qwen3-4B-Instruct-2507 unless noted)

config	env id	harness	tools	notes
`configs/reverse_text/v1.toml`	`reverse-text` (`v1=true`)	`vf.Harness`	—
`configs/alphabet_sort/v1.toml`	`alphabet-sort` (`v1=true`)	`vf.Harness`	—	multi-turn (min/max_turns=2)
`configs/wikispeedia/rl_qwen3_4b.toml`	`wikispeedia`	`vf.Harness`	`click_link`, `go_back`
`configs/wikispeedia/rl_qwen3_4b_rlm.toml`	`wikispeedia`	`vf.RLM` (via `harness.id`)	same	matches the general-agent v1 harness dispatch in research-environments#395
`configs/general_agent/v1.toml`	`general-agent`	`vf.Harness`	—	depends on research-environments#395 landing (will load as v0 until then)

Single env id, single load_environment; harness selection is config-driven.

Temperature plumbing refactor

Single env-wide temperature per rollout — assume the config surface (one [orchestrator.train.env.sampling] block per env), drop the env → orchestrator → trainer round-trip via output["sampling_args"]:

Drop sampling_args from REQUIRED_STATE_COLUMNS. The orchestrator already knows each env's temperature from its TrainEnv.sampling_args, so there's no reason to require envs to mirror it back through state. This also unblocks v1 envs, which route sampling_args through state["runtime"] and don't surface it at the top level.
interleave_rollout no longer reads output["sampling_args"]["temperature"] or fills completion_temperatures; it leaves the field empty.
The orchestrator fans the env's scalar temperature out across each sample's completion tokens in the existing post-process loop (where advantage / reward / env_name / training_mode are already stamped), before constructing the TrainingBatch.

TrainingSample / TrainingBatch wire format is unchanged. Trainer-side per-token temperature scaling (scaled_logits = logits / temps) keeps working as-is.

Verification

End-to-end RL smokes on a 2× RTX PRO 6000 box (single-node, 1 inference + 1 trainer GPU):

config	model	steps	result
`examples/reverse_text/rl.toml` (v0 baseline)	Qwen3-0.6B-Reverse-Text-SFT	20	Step 0 reward 0.1427 → Step 19 reward 0.7546
`examples/reverse_text/rl.toml` w/ `args = { v1 = true }`	same	20	Step 0 0.1411 → Step 19 0.7748 (v0 ≈ v1)
`configs/alphabet_sort/v1.toml`	Qwen3-4B-Instruct-2507	2	Step 0 reward 0.7093, Step 1 0.6649, `Orchestrator finished`
`configs/general_agent/v1.toml`	Qwen3-4B-Instruct-2507	2	Runs cleanly (needs research-environments#395 checked out locally to exercise the v1 harness)

uv run pytest tests/unit/orchestrator tests/unit/train -q — 205 passed, 4 pre-existing GPU-env failures (tests/unit/train/models/test_qwen3_5_moe* need CUDA visible to pytest; fail identically on origin/main).
uv run pytest tests/unit/orchestrator/test_trajectories.py -q — 20 passed (assertions updated to reflect that completion_temperatures is filled post-interleave by the orchestrator, not by interleave_rollout).
uv run rl @ configs/<each>.toml --dry-run — all 5 smoke configs resolve cleanly.

Notes

reverse-text-rlm is intentionally omitted: the upstream v1 reverse_text_v1.load_environment hardcodes vf.Harness (no harness.id dispatch yet), so swapping in vf.RLM would require an upstream patch.
The verifiers pin sits before #1414. When upstream either makes RenderedTokens JSON-serializable or filters it out before state.assert_serializable(), we can bump to latest main again.
wiki-search v1 is not included as a smoke config: its v1 entry calls state.get_endpoint_config(api="chat") and hands the result to a plain AsyncOpenAI chat client, but prime-rl's rollout endpoint is the renderer-client interception server which expects per-request rollout IDs. Plain chat calls 404 with "Rollout not found". That's a separate v1 chat-endpoint plumbing gap for a follow-up.
Depends on PrimeIntellect-ai/research-environments#360 landing (or being re-pinned) for the wikispeedia configs. Submodule pointer is on the PR head until then.
The general-agent v1 config depends on PrimeIntellect-ai/research-environments#395 landing for the v1 port. Until then, id = "general-agent" resolves to the v0 entry point.

… + add wikispeedia - deps/verifiers -> a64e5f90 (v0.1.15.dev11 release tag). Sits *before* #1414 (per-token prompt attribution to TrajectoryStep), which stuffs a non-JSON-serializable RenderedTokens into trajectory state and breaks v1's state.assert_serializable(). The pin still contains #1462 (TasksetConfig rework) and #1467 (typed RendererConfig). - deps/research-environments -> 6f2bfeded (head of PrimeIntellect-ai/research-environments#360, feat/wikispeedia-v1 + origin/main merged): pulls in the wikispeedia v1 port with a CLI- configurable harness. - Add `wikispeedia` to the `envs` extra and the uv workspace so the env resolves through `uv run`. Co-authored-by: Cursor <cursoragent@cursor.com>

Five minimal 2-GPU smoke configs covering the v1 envs we want to test on this branch. All use `args = { v1 = true }` or v1-shaped EnvConfig. - configs/reverse_text/v1.toml — `reverse-text` env, in-process vf.Harness, no tools. Qwen3-4B-Instruct-2507. - configs/alphabet_sort/v1.toml — `alphabet-sort` env, multi-turn (min_turns=max_turns=2), in-process vf.Harness. Qwen3-4B. - configs/wikispeedia/rl_qwen3_4b.toml — `wikispeedia` env, in-process vf.Harness with click_link/go_back tools. - configs/wikispeedia/rl_qwen3_4b_rlm.toml — same `wikispeedia` env id, harness swapped to RLM via `[orchestrator.train.env.args.config.harness] id = "verifiers.v1.packages.harnesses.rlm"`. Matches the harness- dispatch pattern from research-environments#395. - configs/general_agent/v1.toml — `general-agent` env (depends on research-environments#395 — currently lands as v0 until #395 ships). In-process vf.Harness, ac freq=1 to keep the 4B trainer in memory. Co-authored-by: Cursor <cursoragent@cursor.com>

Single env-wide temperature per rollout — assume the config surface (one [orchestrator.train.env.sampling] block per env), drop the output -> trajectories round-trip: - Drop `sampling_args` from REQUIRED_STATE_COLUMNS. The orchestrator already knows each env's temperature from its TrainEnv.sampling_args, so there's no reason to require envs to mirror it back through state. This also unblocks v1 envs, which route sampling_args through state["runtime"] and don't surface it at the top level. - interleave_rollout no longer reads output["sampling_args"]["temperature"] or fills `completion_temperatures`; it leaves the field empty. - The orchestrator fans the env's scalar temperature out across each sample's completion tokens in the existing post-process loop (where advantage / reward / env_name / training_mode are already stamped), before constructing the TrainingBatch. TrainingSample / TrainingBatch wire format is unchanged. Trainer-side per-token temperature scaling (scaled_logits = logits / temps) keeps working as-is. Tests: update tests/unit/orchestrator/test_trajectories.py to assert `completion_temperatures == []` post-interleave (the fan-out happens in the orchestrator, not exercised by these tests). Co-authored-by: Cursor <cursoragent@cursor.com>

mikasenghaas force-pushed the exp/verifiers-v1 branch from 376c87c to 95b3203 Compare May 26, 2026 00:23

mikasenghaas changed the title ~~exp: verifiers v1 smoke configs (reverse-text + wikispeedia, default + RLM)~~ exp: verifiers v1 smoke configs (reverse-text + wikispeedia, default + RLM via harness.id) May 26, 2026

mikasenghaas changed the title ~~exp: verifiers v1 smoke configs (reverse-text + wikispeedia, default + RLM via harness.id)~~ exp: verifiers v1 smoke configs May 26, 2026

mikasenghaas and others added 3 commits May 26, 2026 20:06

mikasenghaas force-pushed the exp/verifiers-v1 branch from b521150 to 39c8d29 Compare May 26, 2026 20:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

exp: verifiers v1 smoke configs#2637

exp: verifiers v1 smoke configs#2637
mikasenghaas wants to merge 3 commits into
mainfrom
exp/verifiers-v1

mikasenghaas commented May 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mikasenghaas commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Submodule pins

Smoke configs (2-GPU, Qwen3-4B-Instruct-2507 unless noted)

Temperature plumbing refactor

Verification

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mikasenghaas commented May 26, 2026 •

edited

Loading