Add strict DPO objective plumbing and preference-pair groundwork by taivu1998 · Pull Request #477 · rllm-org/rllm

taivu1998 · 2026-04-02T22:53:28Z

Summary

This PR adds the first strict DPO implementation slice for the experimental trainer.

It does not claim end-to-end DPO training is fully supported yet. Instead, it adds the trainer-side objective plumbing, strict pair construction, and explicit backend gating needed to support issue #329 without overcomplicating the current architecture.

What Changed

added rllm.algorithm.objective with rl | dpo
added a minimal rllm.algorithm.dpo config block
added strict PreferencePair construction from filtered TrajectoryGroups
limited pair construction to the safest v1 case:
- single-step trajectories only
- best-vs-worst pairing only
- identical prompt_ids required
- ties and small reward gaps can be dropped
threaded preference_pairs through TrainerState
branched the unified trainer so objective=dpo skips the RL advantage path
added explicit backend guards:
- Tinker fails fast for DPO
- Verl validates DPO prerequisites and fails fast until a verified actor-loss path is available
made rllm.experimental.verl exports lazy so lightweight imports do not eagerly require the optional verl dependency

Why This Shape

The current trainer is built around scalar advantages and single-trajectory backend batches. DPO is pairwise and reference-relative, so reusing step.advantage would be misleading and fragile.

This PR keeps the implementation intentionally small and strict:

no changes to canonical core data structures
no BackendProtocol redesign
no attempt to solve arbitrary multi-step preference alignment in v1

That gives us a clean foundation for issue #329 while keeping the current RL and distillation paths intact.

Tests

Ran:

PYTHONPATH=/tmp/rllm-issue-329 /Users/vuductai/Documents/Projects/rllm/.venv/bin/pytest tests/unified_trainer/test_preference.py tests/test_verl_policy_loss.py -k 'not TestVerlActorPatch and not TestVerlKnownLosses' tests/cli/test_train_command.py::TestBuildTrainConfig -q
/Users/vuductai/Documents/Projects/rllm/.venv/bin/python -m py_compile rllm/experimental/verl/__init__.py rllm/experimental/common/config.py rllm/experimental/common/preference.py rllm/experimental/unified_trainer.py rllm/experimental/verl/verl_backend.py rllm/trainer/tinker/tinker_backend.py tests/unified_trainer/test_preference.py
git diff --check

The full Verl-dependent test subset could not be run in this environment because the optional verl package is not installed locally.

Issue

Partially addresses #329.

This PR adds the strict DPO trainer groundwork and explicit support boundaries; a follow-up is still needed to wire a verified Verl-side DPO actor-loss path before end-to-end DPO training can be claimed as supported.

taivu1998 · 2026-04-19T16:11:17Z

Hi @kylemontgomery1, @listar2000, could you help review this PR? Thanks!

listar2000 · 2026-04-20T06:57:11Z

Hi @taivu1998 thx for the PR! Will take a look later today

Might need to think more about where this code lives, e.g. this seems to fit in as a better backend (e.g. VerlDPOBackend) since it's not supporting Tinker for now and not a usual RL algo.

But overall having DPO is definitely a great adds-on IMO.

taivu1998 · 2026-05-11T02:58:20Z

Thanks @listar2000, I updated the code with this new design. Could you help take another look? Thanks!

taivu1998 force-pushed the tdv/issue-329-dpo branch from 21ef37d to 6add236 Compare April 19, 2026 16:08

taivu1998 added 4 commits May 9, 2026 21:19

Add strict DPO trainer plumbing

679dea1

Fix Tinker CI workflow and lint follow-ups

424c18e

fixup! Add strict DPO trainer plumbing

c3336da

Move DPO training into Verl backend

99c406e

taivu1998 force-pushed the tdv/issue-329-dpo branch from 6add236 to 99c406e Compare May 10, 2026 09:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add strict DPO objective plumbing and preference-pair groundwork#477

Add strict DPO objective plumbing and preference-pair groundwork#477
taivu1998 wants to merge 4 commits into
rllm-org:mainfrom
taivu1998:tdv/issue-329-dpo

taivu1998 commented Apr 2, 2026

Uh oh!

taivu1998 commented Apr 19, 2026

Uh oh!

listar2000 commented Apr 20, 2026

Uh oh!

taivu1998 commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

taivu1998 commented Apr 2, 2026

Summary

What Changed

Why This Shape

Tests

Issue

Uh oh!

taivu1998 commented Apr 19, 2026

Uh oh!

listar2000 commented Apr 20, 2026

Uh oh!

taivu1998 commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants