setup stream dependencies inside work wrapper #248

tushar00jain · 2025-07-26T03:20:10Z

Summary:

extend the work wrapper object to also do the division post allreduce
add api to block_current_stream on work wrapper so it can be used for HSDP

Stack created with Sapling. Best reviewed with ReviewStack.

torchft/work.py

d4l3k · 2025-07-28T18:41:06Z

torchft/manager.py

+        return True
+
+    def get_future(self) -> torch.futures.Future[torch.Tensor]:
+        self.wait()


.wait() should be a blocking call, we probably want to invert this logic and make .wait() call get_future() instead

yep, the .wait() and .get_future() should only be called when we want to block. for diloco, it's called at the sync step. iiuc manager allreduce method doesn't get called from hsdp so it's unrelated there. this also made me realize we call .get_future() for bucketized allreduce though where we don't want to block. thinking we can pass the callback to manager allreduce for that. guess it's not nice api vise to block in this method. so maybe we need option 1 but it seems to have some issues

the .wait() already calls .get_future(), we want to make sure when users of this api call .get_future(), we've already setup the stream dependencies for work and the tensor division

get_future now calls block_current_stream

d4l3k · 2025-07-30T00:45:21Z

torchft/manager.py

+
+        self._is_set_future_callback_called = False
+
+    def _set_future_callback(


Why does this need to live in the Work object? Can't we pass the stream + the future to the _WorkWrapper and have it manage things correctly?

that would be ideal but doesn't work in all cases i think

for nccl, we need to call work.wait() before doing everything in _set_future_callback(), otherwise stream dependency is not hooked up in the right order i think i.e. we could end up calling future.wait before work.wait

for cpu, we can't call work.wait() because that'll block

these 2 conflict with each other, so this is what i came up with

@tushar00jain you can use work.synchronize() to setup the dependency in a guaranteed non-blocking way

@d4l3k that's for nccl right? for nccl you mentioned block_current_stream also just calls synchronize but it also works in a non-blocking way for gloo. we needed block_current_stream for that because i'm guessing syncrhronize does't do that for gloo.

also based on our discussion offline, the current api's work for all cases and have the same semantics as the underlying process group work

in torchft, we only ever use work.wait(), and we call it only when we need to synchronize

for nccl, and gloo with cuda, this sets up stream deps properly with a custom stream that we synchronize on to wait for the allreduce to finish along with the future associated with that work

for gloo with cpu, it just blocks until the work is done. the future callbacks run after the work is done

that was a lie, we also call work.get_future() in bucketized allreduce

in this case we call block_current_stream first to set up stream dep for nccl (just a proxy to work.synchronize), and gloo with cuda. we also add a callback to the future chain but carefully set up the stream dep after all the other stream deps have been set up. that's why we call block_current_stream in get_future anyway

for gloo with cpu, it doesn't call anything on work because futures anyway run after the work is done

we will call work.block_current_stream for hsdp in torchtitan -- this is pretty much the same as the case above for bucketized allreduce

for ddp, we call get_future but don't expect users to do anything besides calling .wait on that future

In the future,

we can consider creating our own future instead of using torch.futures.Future that sets up stream deps like the way we want it to

consider simplifying the implementation of _ManagedWork (the above will also help us do that)

d4l3k · 2025-08-01T00:44:14Z

torchft/manager.py

+
+        self._is_set_future_callback_called = True
+
+    def wait(self, timeout: Optional[timedelta] = None) -> bool:


.wait() should set a dependency between the work and the current stream -- it looks like we're running all operations on self._stream?

d4l3k

accepting to unblock -- this seems like it will work for our current use cases

d4l3k · 2025-08-01T01:46:41Z

torchft/manager.py

+
+        return True
+
+    def block_current_stream(self, timeout: Optional[timedelta] = None) -> None:


we probably shouldn't rely on this until we've thought this through more / tested

yeah we can test it more before we change the hsdp implementation. think we can also do some other alternative for bucketized allreduce and ddp without having to use block_current_stream

Summary: - extend the work wrapper object to also do the division post allreduce - add api to block_current_stream on work wrapper so it can be used for HSDP

This was referenced Jul 26, 2025

fix stream dependencies in callbacks #246

Merged

make checkpointing thread safe #245

Merged

fix compute/communication overlap for gloo #240

Merged

return work from manager allreduce #247

Merged

use http transport #244

Merged

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jul 26, 2025

tushar00jain changed the title ~~option 2 - call worl.wait inside future callback~~ option 2 - call work.wait inside future callback Jul 26, 2025

tushar00jain force-pushed the pr248 branch 2 times, most recently from 92ad240 to b9cd277 Compare July 26, 2025 04:43

tushar00jain changed the title ~~option 2 - call work.wait inside future callback~~ option 2 - call work.wait inside wrapped work Jul 26, 2025

tushar00jain force-pushed the pr248 branch 3 times, most recently from 4162c4a to e137ed1 Compare July 26, 2025 18:23

tushar00jain mentioned this pull request Jul 26, 2025

only use nightly pytorch in ci #243

Merged

tushar00jain force-pushed the pr248 branch from e137ed1 to 4e03e0f Compare July 26, 2025 19:11

d4l3k reviewed Jul 28, 2025

View reviewed changes

tushar00jain force-pushed the pr248 branch 8 times, most recently from 14510dd to 134e01f Compare July 29, 2025 04:18

tushar00jain mentioned this pull request Jul 29, 2025

fix managed pg allreduce #249

Open

tushar00jain force-pushed the pr248 branch 5 times, most recently from 83957be to d607c85 Compare July 29, 2025 06:13

tushar00jain force-pushed the pr248 branch from e568d15 to 84e61f5 Compare July 29, 2025 18:09

tushar00jain changed the title ~~option 2 - call work.wait inside wrapped work~~ setup stream dependencies inside work wrapper Jul 29, 2025

tushar00jain force-pushed the pr248 branch 2 times, most recently from b16e0cb to 91194a1 Compare July 29, 2025 22:32

d4l3k reviewed Jul 30, 2025

View reviewed changes

tushar00jain force-pushed the pr248 branch 17 times, most recently from 217f0d0 to 220ddf5 Compare July 31, 2025 02:47

d4l3k reviewed Aug 1, 2025

View reviewed changes

d4l3k approved these changes Aug 1, 2025

View reviewed changes

tushar00jain force-pushed the pr248 branch 2 times, most recently from fc31fd4 to 2e5743d Compare August 1, 2025 06:31

setup stream dependencies inside work wrapper

a608410

Summary: - extend the work wrapper object to also do the division post allreduce - add api to block_current_stream on work wrapper so it can be used for HSDP

tushar00jain force-pushed the pr248 branch from 2e5743d to a608410 Compare August 1, 2025 17:32

tushar00jain merged commit d358fb4 into pytorch:main Aug 1, 2025
14 checks passed

tushar00jain deleted the pr248 branch August 1, 2025 19:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

setup stream dependencies inside work wrapper #248

setup stream dependencies inside work wrapper #248

Uh oh!

tushar00jain commented Jul 26, 2025 •

edited

Loading

Uh oh!

Uh oh!

d4l3k Jul 28, 2025

Uh oh!

tushar00jain Jul 28, 2025 •

edited

Loading

Uh oh!

tushar00jain Jul 30, 2025

Uh oh!

d4l3k Jul 30, 2025

Uh oh!

tushar00jain Jul 30, 2025

Uh oh!

d4l3k Jul 30, 2025

Uh oh!

tushar00jain Jul 31, 2025 •

edited

Loading

Uh oh!

d4l3k Aug 1, 2025

Uh oh!

d4l3k left a comment

Uh oh!

d4l3k Aug 1, 2025

Uh oh!

tushar00jain Aug 1, 2025

Uh oh!

Uh oh!

Uh oh!


		self._is_set_future_callback_called = False

		def _set_future_callback(


		self._is_set_future_callback_called = True

		def wait(self, timeout: Optional[timedelta] = None) -> bool:


		return True

		def block_current_stream(self, timeout: Optional[timedelta] = None) -> None:

setup stream dependencies inside work wrapper #248

setup stream dependencies inside work wrapper #248

Uh oh!

Conversation

tushar00jain commented Jul 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tushar00jain Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tushar00jain Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

d4l3k left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tushar00jain commented Jul 26, 2025 •

edited

Loading

tushar00jain Jul 28, 2025 •

edited

Loading

tushar00jain Jul 31, 2025 •

edited

Loading