This is the official repository for the paper Language Models Can Learn from Verbal Feedback Without Scalar Rewards.
A training framework that implements Feedback Conditional Policy (FCP) for aligning large language models with verbal feedback.
- verl framework
- Set your
OPENAI_API_KEYenvironment variable before training
Use LLaMA-Factory's built-in SFT training code with the SFT datasets mentioned below.
Run the VERL training script:
./verl/recipe/fcp/run_fcp.shConfiguration details can be found in verl/recipe/fcp/config/fcp_trainer.yaml.
We use different frameworks and datasets for different training stages:
Framework: LLaMA-Factory
Datasets:
Framework: verl
Datasets:
If you find this code useful, please consider citing our paper:
@article{luo2025languagemodelslearnverbal,
title={Language Models Can Learn from Verbal Feedback Without Scalar Rewards},
author={Renjie Luo and Zichen Liu and Xiangyan Liu and Chao Du and Min Lin and Wenhu Chen and Wei Lu and Tianyu Pang},
journal={arXiv preprint arXiv:2509.22638},
year={2025}
}