Skip to content

Could you please explain how GenRM's output is converted into rewards for RLHF? #8

@HITlgw

Description

@HITlgw

Thank you for the excellent work on GenRM. I have a question regarding the training objective and reward mechanism, and I’d appreciate your insights.

According to my understanding, in your work, the GRPO training objective for GenRM is to maximize MetaJudge's Reward. This means that GenRM's evaluation for response A and response B not only requires the outcome A>B to be correct, but also the comparing process to be correct. Under this training objective, GenRM can accurately output the relative quality relationship between A and B.

My question is: How is this relative quality relationship between A and B converted into reward signals in GRPO for downstream tasks like Arena Hard v2?

Thank you for your time and assistance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions