You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* remove recovery form regression test
Summary:
- we currently do some validation on the training in the regression test
- the force recovery on first step interferes with this because it makes the test non determinstic, particularly because after the recovery, replica takes non deterministic number of steps that makes the gradients non determinstic
- to fix this, perform a quorum inside fake training loop for the regression test before doing any training
- we also need to increase manager step count by 2, so we do 2 should_commit, because we have 2 fragments and we're testing numerics as if we started from step 0 -- starting from step 2 gives us the same sync schedule for fragments as starting from step 0
* fix compute/communication overlap for gloo
Summary:
- we current wait for pg work's future when preparing for a fragment
- if we use gloo, this blocks the cpu
- move the wait call to when we perform the actual sync of the fragment
0 commit comments