fix compute/communication overlap for gloo #240

tushar00jain · 2025-07-22T19:36:51Z

Summary:

we current wait for pg work's future when preparing for a fragment
if we use gloo, this blocks the cpu
move the wait call to when we perform the actual sync of the fragment

Stack created with Sapling. Best reviewed with ReviewStack.

torchft/manager.py

Summary: - we currently do some validation on the training in the regression test - the force recovery on first step interferes with this because it makes the test non determinstic, particularly because after the recovery, replica takes non deterministic number of steps that makes the gradients non determinstic - to fix this, perform a quorum inside fake training loop for the regression test before doing any training - we also need to increase manager step count by 2, so we do 2 should_commit, because we have 2 fragments and we're testing numerics as if we started from step 0 -- starting from step 2 gives us the same sync schedule for fragments as starting from step 0

Summary: - we current wait for pg work's future when preparing for a fragment - if we use gloo, this blocks the cpu - move the wait call to when we perform the actual sync of the fragment

tushar00jain mentioned this pull request Jul 22, 2025

allow using gloo from flag #239

Merged

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jul 22, 2025

tushar00jain force-pushed the pr240 branch from 8e600be to 5a34311 Compare July 22, 2025 21:10

tushar00jain marked this pull request as draft July 22, 2025 22:11

tushar00jain force-pushed the pr240 branch 4 times, most recently from dbec11e to 5228ee8 Compare July 24, 2025 21:53

tushar00jain changed the title ~~use block_current_stream work api~~ wait for futures while syncing fragments Jul 24, 2025

tushar00jain changed the title ~~wait for futures while syncing fragments~~ use block_current_stream work api Jul 24, 2025

tushar00jain force-pushed the pr240 branch from 99d7ef4 to 5228ee8 Compare July 24, 2025 21:56

tushar00jain changed the title ~~use block_current_stream work api~~ wait for futures while syncing fragments Jul 24, 2025

tushar00jain mentioned this pull request Jul 24, 2025

use block_current_stream work api #241

Closed

tushar00jain force-pushed the pr240 branch 14 times, most recently from c93ad11 to bfb92ff Compare July 25, 2025 21:20

tushar00jain marked this pull request as ready for review July 25, 2025 21:21

tushar00jain requested a review from d4l3k July 25, 2025 21:21

d4l3k reviewed Jul 25, 2025

View reviewed changes

torchft/manager.py Outdated Show resolved Hide resolved

tushar00jain force-pushed the pr240 branch 24 times, most recently from 0f8d6f8 to 1aea062 Compare August 1, 2025 19:51

tushar00jain mentioned this pull request Aug 1, 2025

remove recovery form regression test #251

Merged

tushar00jain added 2 commits August 1, 2025 12:55

fix compute/communication overlap for gloo

0fbae9e

Summary: - we current wait for pg work's future when preparing for a fragment - if we use gloo, this blocks the cpu - move the wait call to when we perform the actual sync of the fragment

tushar00jain force-pushed the pr240 branch from 1aea062 to 0fbae9e Compare August 1, 2025 19:55

tushar00jain merged commit 22b8fa1 into pytorch:main Aug 2, 2025
14 checks passed

tushar00jain deleted the pr240 branch August 2, 2025 00:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix compute/communication overlap for gloo #240

fix compute/communication overlap for gloo #240

Uh oh!

tushar00jain commented Jul 22, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fix compute/communication overlap for gloo #240

fix compute/communication overlap for gloo #240

Uh oh!

Conversation

tushar00jain commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tushar00jain commented Jul 22, 2025 •

edited

Loading