Add strict DPO objective plumbing and preference-pair groundwork#477
Open
taivu1998 wants to merge 4 commits into
Open
Add strict DPO objective plumbing and preference-pair groundwork#477taivu1998 wants to merge 4 commits into
taivu1998 wants to merge 4 commits into
Conversation
21ef37d to
6add236
Compare
Contributor
Author
|
Hi @kylemontgomery1, @listar2000, could you help review this PR? Thanks! |
Collaborator
|
Hi @taivu1998 thx for the PR! Will take a look later today Might need to think more about where this code lives, e.g. this seems to fit in as a better backend (e.g. But overall having DPO is definitely a great adds-on IMO. |
6add236 to
99c406e
Compare
Contributor
Author
|
Thanks @listar2000, I updated the code with this new design. Could you help take another look? Thanks! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds the first strict DPO implementation slice for the experimental trainer.
It does not claim end-to-end DPO training is fully supported yet. Instead, it adds the trainer-side objective plumbing, strict pair construction, and explicit backend gating needed to support issue #329 without overcomplicating the current architecture.
What Changed
rllm.algorithm.objectivewithrl | dporllm.algorithm.dpoconfig blockPreferencePairconstruction from filteredTrajectoryGroupsprompt_idsrequiredpreference_pairsthroughTrainerStateobjective=dposkips the RL advantage pathrllm.experimental.verlexports lazy so lightweight imports do not eagerly require the optionalverldependencyWhy This Shape
The current trainer is built around scalar advantages and single-trajectory backend batches. DPO is pairwise and reference-relative, so reusing
step.advantagewould be misleading and fragile.This PR keeps the implementation intentionally small and strict:
BackendProtocolredesignThat gives us a clean foundation for issue #329 while keeping the current RL and distillation paths intact.
Tests
Ran:
PYTHONPATH=/tmp/rllm-issue-329 /Users/vuductai/Documents/Projects/rllm/.venv/bin/pytest tests/unified_trainer/test_preference.py tests/test_verl_policy_loss.py -k 'not TestVerlActorPatch and not TestVerlKnownLosses' tests/cli/test_train_command.py::TestBuildTrainConfig -q/Users/vuductai/Documents/Projects/rllm/.venv/bin/python -m py_compile rllm/experimental/verl/__init__.py rllm/experimental/common/config.py rllm/experimental/common/preference.py rllm/experimental/unified_trainer.py rllm/experimental/verl/verl_backend.py rllm/trainer/tinker/tinker_backend.py tests/unified_trainer/test_preference.pygit diff --checkThe full Verl-dependent test subset could not be run in this environment because the optional
verlpackage is not installed locally.Issue
Partially addresses #329.
This PR adds the strict DPO trainer groundwork and explicit support boundaries; a follow-up is still needed to wire a verified Verl-side DPO actor-loss path before end-to-end DPO training can be claimed as supported.