Thank you for the excellent work on GenRM. I have a question regarding the training objective and reward mechanism, and I’d appreciate your insights.
According to my understanding, in your work, the GRPO training objective for GenRM is to maximize MetaJudge's Reward. This means that GenRM's evaluation for response A and response B not only requires the outcome A>B to be correct, but also the comparing process to be correct. Under this training objective, GenRM can accurately output the relative quality relationship between A and B.
My question is: How is this relative quality relationship between A and B converted into reward signals in GRPO for downstream tasks like Arena Hard v2?
Thank you for your time and assistance.
Thank you for the excellent work on GenRM. I have a question regarding the training objective and reward mechanism, and I’d appreciate your insights.
According to my understanding, in your work, the GRPO training objective for GenRM is to maximize MetaJudge's Reward. This means that GenRM's evaluation for response A and response B not only requires the outcome A>B to be correct, but also the comparing process to be correct. Under this training objective, GenRM can accurately output the relative quality relationship between A and B.
My question is: How is this relative quality relationship between A and B converted into reward signals in GRPO for downstream tasks like Arena Hard v2?
Thank you for your time and assistance.