fix compute/communication overlap for gloo (#240)

tushar00jain · web-flow · commit 22b8fa1317c8 · 2025-08-01T17:11:07.000-07:00
* remove recovery form regression test

Summary:
- we currently do some validation on the training in the regression test
- the force recovery on first step interferes with this because it makes the test non determinstic, particularly because after the recovery, replica takes non deterministic number of steps that makes the gradients non determinstic
- to fix this, perform a quorum inside fake training loop for the regression test before doing any training
- we also need to increase manager step count by 2, so we do 2 should_commit, because we have 2 fragments and we're testing numerics as if we started from step 0 -- starting from step 2 gives us the same sync schedule for fragments as starting from step 0

* fix compute/communication overlap for gloo

Summary:
- we current wait for pg work's future when preparing for a fragment
- if we use gloo, this blocks the cpu
- move the wait call to when we perform the actual sync of the fragment
diff --git a/torchft/local_sgd.py b/torchft/local_sgd.py
@@ -400,13 +400,6 @@ def prepare_sync(self) -> None:
         ):
             self._average_grads()
 
-            for work in self._allreduce_work:
-                work.wait()
-
-            if self._stream is not None:
-                self._stop_event = torch.cuda.Event()
-                self._stop_event.record()
-
     @torch.profiler.record_function("torchft::local_sgd::perform_sync")
     def perform_sync(self) -> bool:
         """
@@ -416,6 +409,18 @@ def perform_sync(self) -> bool:
         # Waiting for an allreduce before it has been sent is currently not supported.
         assert len(self._allreduce_work) > 0
 
+        with (
+            torch.cuda.stream(self._stream)
+            if self._stream is not None
+            else nullcontext()
+        ):
+            for work in self._allreduce_work:
+                work.wait()
+
+            if self._stream is not None:
+                self._stop_event = torch.cuda.Event()
+                self._stop_event.record()
+
         self.wait()
 
         # save the parameters so they can be used for merging