Could you please explain how GenRM's output is converted into rewards for RLHF?

Thank you for the excellent work on GenRM. I have a question regarding the training objective and reward mechanism, and I’d appreciate your insights.

According to my understanding, in your work, the GRPO training objective for GenRM is to maximize MetaJudge's Reward. This means that GenRM's evaluation for response A and response B not only requires the outcome A>B to be correct, but also the comparing process to be correct. Under this training objective, GenRM can accurately output the relative quality relationship between A and B.

My question is: How is this relative quality relationship between A and B converted into reward signals in GRPO for downstream tasks like Arena Hard v2? 

Thank you for your time and assistance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Could you please explain how GenRM's output is converted into rewards for RLHF? #8

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Could you please explain how GenRM's output is converted into rewards for RLHF? #8

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions