Skip to content

Commit f5eb999

Browse files
committed
Update README.md
1 parent f6b8e08 commit f5eb999

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ Aligned is faster in wall‑time (fewer sibling evaluations per node), while cro
3636

3737
### Advantage
3838

39-
Advantages are used to optimize the agents policies, which use a mean baseline without any standard‑deviation normalization to make training unbiased (see [Dr. GRPO](https://arxiv.org/pdf/2503.20783)). We do not apply importance sampling ratios either, since our training is in an on-policy manner (the same policy is used for sampling and training).
39+
Advantages are used to optimize the agents policies, which use a mean baseline without any standard‑deviation normalization to make training unbiased (see [Dr. GRPO](https://arxiv.org/pdf/2503.20783)). We do not apply importance sampling ratios either, since our training is in an on-policy manner (hence also no need for epsilon clipping).
4040

4141
### Number of Samples
4242

0 commit comments

Comments
 (0)