Skip to content

Add strict DPO objective plumbing and preference-pair groundwork#477

Open
taivu1998 wants to merge 4 commits into
rllm-org:mainfrom
taivu1998:tdv/issue-329-dpo
Open

Add strict DPO objective plumbing and preference-pair groundwork#477
taivu1998 wants to merge 4 commits into
rllm-org:mainfrom
taivu1998:tdv/issue-329-dpo

Conversation

@taivu1998

Copy link
Copy Markdown
Contributor

Summary

This PR adds the first strict DPO implementation slice for the experimental trainer.

It does not claim end-to-end DPO training is fully supported yet. Instead, it adds the trainer-side objective plumbing, strict pair construction, and explicit backend gating needed to support issue #329 without overcomplicating the current architecture.

What Changed

  • added rllm.algorithm.objective with rl | dpo
  • added a minimal rllm.algorithm.dpo config block
  • added strict PreferencePair construction from filtered TrajectoryGroups
  • limited pair construction to the safest v1 case:
    • single-step trajectories only
    • best-vs-worst pairing only
    • identical prompt_ids required
    • ties and small reward gaps can be dropped
  • threaded preference_pairs through TrainerState
  • branched the unified trainer so objective=dpo skips the RL advantage path
  • added explicit backend guards:
    • Tinker fails fast for DPO
    • Verl validates DPO prerequisites and fails fast until a verified actor-loss path is available
  • made rllm.experimental.verl exports lazy so lightweight imports do not eagerly require the optional verl dependency

Why This Shape

The current trainer is built around scalar advantages and single-trajectory backend batches. DPO is pairwise and reference-relative, so reusing step.advantage would be misleading and fragile.

This PR keeps the implementation intentionally small and strict:

  • no changes to canonical core data structures
  • no BackendProtocol redesign
  • no attempt to solve arbitrary multi-step preference alignment in v1

That gives us a clean foundation for issue #329 while keeping the current RL and distillation paths intact.

Tests

Ran:

  • PYTHONPATH=/tmp/rllm-issue-329 /Users/vuductai/Documents/Projects/rllm/.venv/bin/pytest tests/unified_trainer/test_preference.py tests/test_verl_policy_loss.py -k 'not TestVerlActorPatch and not TestVerlKnownLosses' tests/cli/test_train_command.py::TestBuildTrainConfig -q
  • /Users/vuductai/Documents/Projects/rllm/.venv/bin/python -m py_compile rllm/experimental/verl/__init__.py rllm/experimental/common/config.py rllm/experimental/common/preference.py rllm/experimental/unified_trainer.py rllm/experimental/verl/verl_backend.py rllm/trainer/tinker/tinker_backend.py tests/unified_trainer/test_preference.py
  • git diff --check

The full Verl-dependent test subset could not be run in this environment because the optional verl package is not installed locally.

Issue

Partially addresses #329.

This PR adds the strict DPO trainer groundwork and explicit support boundaries; a follow-up is still needed to wire a verified Verl-side DPO actor-loss path before end-to-end DPO training can be claimed as supported.

@taivu1998

Copy link
Copy Markdown
Contributor Author

Hi @kylemontgomery1, @listar2000, could you help review this PR? Thanks!

@listar2000

Copy link
Copy Markdown
Collaborator

Hi @taivu1998 thx for the PR! Will take a look later today

Might need to think more about where this code lives, e.g. this seems to fit in as a better backend (e.g. VerlDPOBackend) since it's not supporting Tinker for now and not a usual RL algo.

But overall having DPO is definitely a great adds-on IMO.

@taivu1998 taivu1998 force-pushed the tdv/issue-329-dpo branch from 6add236 to 99c406e Compare May 10, 2026 09:42
@taivu1998

Copy link
Copy Markdown
Contributor Author

Thanks @listar2000, I updated the code with this new design. Could you help take another look? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants