Skip to content

RU-Automated-Reasoning-Group/PREFORL

Repository files navigation

PREFORL

Reference implementation for Preference-based Policy Optimization from Sparse-reward Offline Dataset (ICLR 2026).

Highlights

  • PREFORL core training pipelines for Adroit, Maze2D, AntMaze, MuJoCo, and MetaWorld.
  • MetaWorld BC baseline entrypoint for core comparison results.
  • Unified JSON result outputs under a configurable result_dir.
  • Lightweight scripts for local result summarization.

Repository Layout

  • train_adroit.py - Adroit core training entrypoint.
  • train_maze2d.py - Maze2D core training entrypoint.
  • train_antmaze.py - AntMaze core training entrypoint.
  • train_mujoco.py - MuJoCo core training entrypoint.
  • train_metaworld.py - MetaWorld core training entrypoint.
  • train_metaworld_bc.py - MetaWorld BC core comparison entrypoint.
  • pnp/ - Core PREFORL trainers and method utilities.
  • envs/ - Environment wrappers and MetaWorld registrations.
  • scripts/summarize_results.py - Aggregates JSON metrics from result_dir.
  • results/ - Default output directory for run metrics.

Quick Start

  1. Install dependencies:
pip install -r requirements.txt
  1. Run one short smoke job:
python train_adroit.py \
  --env_name hammer-expert \
  --algorithm PNP \
  --num_algo_iters 2 \
  --PNP_num_epochs 1 \
  --result_dir results
  1. Summarize local outputs:
python scripts/summarize_results.py --results_dir results

JSON files keep per-iteration metrics under metrics, including iteration, max_return, episode_returns, and success_rate.

Core Reproduction Entry Points

Adroit:

python train_adroit.py --env_name pen-expert --algorithm PNP --result_dir results

Maze2D:

python train_maze2d.py --env_name medium --algorithm PNP --result_dir results

AntMaze:

python train_antmaze.py --env_name medium-play-v1 --algorithm PNP --result_dir results

MuJoCo:

python train_mujoco.py --env_name halfcheetah-medium-expert --algorithm PNP --result_dir results

MetaWorld:

python train_metaworld.py --env_name hammer-v2 --result_dir results

MetaWorld BC (core comparison):

python train_metaworld_bc.py --env_name hammer-v2 --num_demos 50 --result_dir results

Notes

  • Run outputs are written to result_dir (default: results).
  • Each run stores per-iteration metrics under metrics, including iteration, max_return, episode_returns, and success_rate.

Citation

If you use this code, please cite the ICLR 2026 paper:

@inproceedings{qiu2026preforl,
  title={Preference-based Policy Optimization from Sparse-reward Offline Dataset},
  author={Wenjie Qiu and Guofeng Cui and Shicheng Liu and Yuanlin Duan and He Zhu},
  booktitle={International Conference on Learning Representations},
  year={2026}
}

License

Apache License 2.0. See LICENSE.

TODO

  • BC expert demo generation.

About

ICLR'26 Preference-based Policy Optimization

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors