Based on the paper, the Simulator(Environment) policy will be updated:
Build weighted env SFT set DΛπ‘ env from Tπ‘ with weights β exp(ππ
env (Λπ)) (equation 3).
Update environment πenv via RWR on DΛπ‘ env to maximize πΌ[π
env] (equation 3).
Maybe I'm blind, but can you please kindly confirm whether it was in the scope of this repo?
Thank you
Based on the paper, the Simulator(Environment) policy will be updated:
Maybe I'm blind, but can you please kindly confirm whether it was in the scope of this repo?
Thank you