How is RWR implemented?

Based on the paper, the Simulator(Environment) policy will be updated:

> Build weighted env SFT set D˜𝑡 env from T𝑡 with weights ∝ exp(𝜆𝑅env (ˆ𝑝)) (equation 3).
Update environment 𝜋env via RWR on D˜𝑡 env to maximize 𝔼[𝑅env] (equation 3).

Maybe I'm blind, but can you please kindly confirm whether it was in the scope of this repo?

Thank you