Train against raw policy logprobs by samsja · Pull Request #2604 · PrimeIntellect-ai/prime-rl

samsja · 2026-05-23T01:02:44Z

Summary

force vLLM to return raw_logprobs
stop replaying sampling temperature in trainer policy logprob/entropy computation
keep fused LM-head signatures compatible while computing raw policy logprobs, and update fused/vanilla tests accordingly

Validation

uv run ruff check packages/prime-rl-configs/src/prime_rl/configs/inference.py src/prime_rl/trainer/rl/train.py src/prime_rl/trainer/models/layers/lm_head.py src/prime_rl/trainer/models/layers/lm_head_gemma.py tests/unit/train/rl/test_fused_lm_head.py tests/unit/train/models/test_nemotron_h_kl.py
uv run pytest tests/unit/train/rl/test_fused_lm_head.py -q -m 'not gpu'
uv run pytest tests/unit/train/rl/test_fused_lm_head.py tests/unit/train/models/test_nemotron_h_kl.py -q -m gpu
uv run python - <<'PY' ... InferenceConfig().to_vllm().logprobs_mode returned raw_logprobs

Reverse-text temp=1.5 smoke was attempted but blocked on this workstation before a valid run completed: the pinned vLLM wheel requires libcudart.so.13, and with a temporary CUDA 13 runtime the local NVIDIA 535.171.04 driver failed with CUDA driver version is insufficient for CUDA runtime version.

fix: train against raw policy logprobs

a1ae77a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Train against raw policy logprobs#2604

Train against raw policy logprobs#2604
samsja wants to merge 1 commit into
mainfrom
fix/raw-logprobs-policy

samsja commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

samsja commented May 23, 2026

Summary

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant