Reference implementation for Preference-based Policy Optimization from Sparse-reward Offline Dataset (ICLR 2026).
- PREFORL core training pipelines for Adroit, Maze2D, AntMaze, MuJoCo, and MetaWorld.
- MetaWorld BC baseline entrypoint for core comparison results.
- Unified JSON result outputs under a configurable
result_dir. - Lightweight scripts for local result summarization.
train_adroit.py- Adroit core training entrypoint.train_maze2d.py- Maze2D core training entrypoint.train_antmaze.py- AntMaze core training entrypoint.train_mujoco.py- MuJoCo core training entrypoint.train_metaworld.py- MetaWorld core training entrypoint.train_metaworld_bc.py- MetaWorld BC core comparison entrypoint.pnp/- Core PREFORL trainers and method utilities.envs/- Environment wrappers and MetaWorld registrations.scripts/summarize_results.py- Aggregates JSON metrics fromresult_dir.results/- Default output directory for run metrics.
- Install dependencies:
pip install -r requirements.txt- Run one short smoke job:
python train_adroit.py \
--env_name hammer-expert \
--algorithm PNP \
--num_algo_iters 2 \
--PNP_num_epochs 1 \
--result_dir results- Summarize local outputs:
python scripts/summarize_results.py --results_dir resultsJSON files keep per-iteration metrics under metrics, including iteration, max_return,
episode_returns, and success_rate.
Adroit:
python train_adroit.py --env_name pen-expert --algorithm PNP --result_dir resultsMaze2D:
python train_maze2d.py --env_name medium --algorithm PNP --result_dir resultsAntMaze:
python train_antmaze.py --env_name medium-play-v1 --algorithm PNP --result_dir resultsMuJoCo:
python train_mujoco.py --env_name halfcheetah-medium-expert --algorithm PNP --result_dir resultsMetaWorld:
python train_metaworld.py --env_name hammer-v2 --result_dir resultsMetaWorld BC (core comparison):
python train_metaworld_bc.py --env_name hammer-v2 --num_demos 50 --result_dir results- Run outputs are written to
result_dir(default:results). - Each run stores per-iteration metrics under
metrics, includingiteration,max_return,episode_returns, andsuccess_rate.
If you use this code, please cite the ICLR 2026 paper:
@inproceedings{qiu2026preforl,
title={Preference-based Policy Optimization from Sparse-reward Offline Dataset},
author={Wenjie Qiu and Guofeng Cui and Shicheng Liu and Yuanlin Duan and He Zhu},
booktitle={International Conference on Learning Representations},
year={2026}
}Apache License 2.0. See LICENSE.
- BC expert demo generation.