exp: verifiers v1 smoke configs#2637
Draft
mikasenghaas wants to merge 3 commits into
Draft
Conversation
376c87c to
95b3203
Compare
… + add wikispeedia - deps/verifiers -> a64e5f90 (v0.1.15.dev11 release tag). Sits *before* #1414 (per-token prompt attribution to TrajectoryStep), which stuffs a non-JSON-serializable RenderedTokens into trajectory state and breaks v1's state.assert_serializable(). The pin still contains #1462 (TasksetConfig rework) and #1467 (typed RendererConfig). - deps/research-environments -> 6f2bfeded (head of PrimeIntellect-ai/research-environments#360, feat/wikispeedia-v1 + origin/main merged): pulls in the wikispeedia v1 port with a CLI- configurable harness. - Add `wikispeedia` to the `envs` extra and the uv workspace so the env resolves through `uv run`. Co-authored-by: Cursor <cursoragent@cursor.com>
Five minimal 2-GPU smoke configs covering the v1 envs we want to test
on this branch. All use `args = { v1 = true }` or v1-shaped EnvConfig.
- configs/reverse_text/v1.toml — `reverse-text` env, in-process
vf.Harness, no tools. Qwen3-4B-Instruct-2507.
- configs/alphabet_sort/v1.toml — `alphabet-sort` env, multi-turn
(min_turns=max_turns=2), in-process vf.Harness. Qwen3-4B.
- configs/wikispeedia/rl_qwen3_4b.toml — `wikispeedia` env, in-process
vf.Harness with click_link/go_back tools.
- configs/wikispeedia/rl_qwen3_4b_rlm.toml — same `wikispeedia` env id,
harness swapped to RLM via `[orchestrator.train.env.args.config.harness]
id = "verifiers.v1.packages.harnesses.rlm"`. Matches the harness-
dispatch pattern from research-environments#395.
- configs/general_agent/v1.toml — `general-agent` env (depends on
research-environments#395 — currently lands as v0 until #395 ships).
In-process vf.Harness, ac freq=1 to keep the 4B trainer in memory.
Co-authored-by: Cursor <cursoragent@cursor.com>
Single env-wide temperature per rollout — assume the config surface (one [orchestrator.train.env.sampling] block per env), drop the output -> trajectories round-trip: - Drop `sampling_args` from REQUIRED_STATE_COLUMNS. The orchestrator already knows each env's temperature from its TrainEnv.sampling_args, so there's no reason to require envs to mirror it back through state. This also unblocks v1 envs, which route sampling_args through state["runtime"] and don't surface it at the top level. - interleave_rollout no longer reads output["sampling_args"]["temperature"] or fills `completion_temperatures`; it leaves the field empty. - The orchestrator fans the env's scalar temperature out across each sample's completion tokens in the existing post-process loop (where advantage / reward / env_name / training_mode are already stamped), before constructing the TrainingBatch. TrainingSample / TrainingBatch wire format is unchanged. Trainer-side per-token temperature scaling (scaled_logits = logits / temps) keeps working as-is. Tests: update tests/unit/orchestrator/test_trajectories.py to assert `completion_temperatures == []` post-interleave (the fan-out happens in the orchestrator, not exercised by these tests). Co-authored-by: Cursor <cursoragent@cursor.com>
b521150 to
39c8d29
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Experiment branch for the verifiers v1 path on prime-rl. Rebased onto
mainafter #2635 (typedRendererConfig) landed, so the diff is now just the bits unique to this branch.Submodule pins
deps/verifiers→a64e5f90(v0.1.15.dev11 release tag). Sits before #1414, which stuffs a non-JSON-serializableRenderedTokensinto trajectory state and breaks v1'sstate.assert_serializable(). Still contains #1462 (TasksetConfig rework) and #1467 (typed RendererConfig).deps/research-environments→6f2bfeded(head of PrimeIntellect-ai/research-environments#360,feat/wikispeedia-v1merged with latestmain). Addswikispeediato theenvsextra and uv workspace.Smoke configs (2-GPU, Qwen3-4B-Instruct-2507 unless noted)
configs/reverse_text/v1.tomlreverse-text(v1=true)vf.Harnessconfigs/alphabet_sort/v1.tomlalphabet-sort(v1=true)vf.Harnessconfigs/wikispeedia/rl_qwen3_4b.tomlwikispeediavf.Harnessclick_link,go_backconfigs/wikispeedia/rl_qwen3_4b_rlm.tomlwikispeediavf.RLM(viaharness.id)configs/general_agent/v1.tomlgeneral-agentvf.HarnessSingle env id, single
load_environment; harness selection is config-driven.Temperature plumbing refactor
Single env-wide temperature per rollout — assume the config surface (one
[orchestrator.train.env.sampling]block per env), drop the env → orchestrator → trainer round-trip viaoutput["sampling_args"]:sampling_argsfromREQUIRED_STATE_COLUMNS. The orchestrator already knows each env's temperature from itsTrainEnv.sampling_args, so there's no reason to require envs to mirror it back through state. This also unblocks v1 envs, which routesampling_argsthroughstate["runtime"]and don't surface it at the top level.interleave_rolloutno longer readsoutput["sampling_args"]["temperature"]or fillscompletion_temperatures; it leaves the field empty.advantage/reward/env_name/training_modeare already stamped), before constructing theTrainingBatch.TrainingSample/TrainingBatchwire format is unchanged. Trainer-side per-token temperature scaling (scaled_logits = logits / temps) keeps working as-is.Verification
End-to-end RL smokes on a 2× RTX PRO 6000 box (single-node, 1 inference + 1 trainer GPU):
examples/reverse_text/rl.toml(v0 baseline)examples/reverse_text/rl.tomlw/args = { v1 = true }configs/alphabet_sort/v1.tomlOrchestrator finishedconfigs/general_agent/v1.tomluv run pytest tests/unit/orchestrator tests/unit/train -q— 205 passed, 4 pre-existing GPU-env failures (tests/unit/train/models/test_qwen3_5_moe*need CUDA visible to pytest; fail identically onorigin/main).uv run pytest tests/unit/orchestrator/test_trajectories.py -q— 20 passed (assertions updated to reflect thatcompletion_temperaturesis filled post-interleave by the orchestrator, not byinterleave_rollout).uv run rl @ configs/<each>.toml --dry-run— all 5 smoke configs resolve cleanly.Notes
reverse-text-rlmis intentionally omitted: the upstream v1reverse_text_v1.load_environmenthardcodesvf.Harness(noharness.iddispatch yet), so swapping invf.RLMwould require an upstream patch.RenderedTokensJSON-serializable or filters it out beforestate.assert_serializable(), we can bump to latest main again.state.get_endpoint_config(api="chat")and hands the result to a plainAsyncOpenAIchat client, but prime-rl's rollout endpoint is the renderer-client interception server which expects per-request rollout IDs. Plain chat calls 404 with"Rollout not found". That's a separate v1 chat-endpoint plumbing gap for a follow-up.id = "general-agent"resolves to the v0 entry point.