-
Notifications
You must be signed in to change notification settings - Fork 37
fix compute/communication overlap for gloo #240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
dbec11e
to
5228ee8
Compare
c93ad11
to
bfb92ff
Compare
torchft/manager.py
Outdated
@@ -411,7 +412,7 @@ def callback( | |||
fut = fut.then(callback) | |||
|
|||
fut = self.wrap_future(fut, tensor) | |||
return fut | |||
return (work, fut) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this makes me a bit nervous since calling only work.wait() means that the code in the future callback may have not run -- i.e tensor /= num_participants
may execute out of order
The advantage of future objects here is that the work runs on the completing thread, where as .wait() runs on the waiting thread. I'm wondering if we should actually wrap these futures back into Work objects so we can do the stream dependency correctly in .wait()
922a415
to
fa35a6d
Compare
2b43bfc
to
dc7b2e0
Compare
Summary: use http transport instead of pg transport -- pg transport fails to resolve address when running locally
4964b72
to
15b0cb0
Compare
405dc6e
to
91207a2
Compare
Summary: deep copy the state dict for sending checkpoint because if the replica moves to the next step, the state dict can change before the checkpoint is sent
Summary: - call future.wait in callbacks to make sure the continuation executes after the future has completed - set the stream correctly to execute callback scheduled by bucketized allreduce
Summary: returns the work object so we can be more flexible with the usage
Summary: - we current wait for pg work's future when preparing for a fragment - if we use gloo, this blocks the cpu - move the wait call to when we perform the actual sync of the fragment - since we still call `work.wait()` in the allreduce call itself this doesn't completely fix the problem
Summary:
work.wait()
in the allreduce call itself this doesn't completely fix the problemStack created with Sapling. Best reviewed with ReviewStack.