workarounds for all2all autograd issues that Ruisi ran into #1604

bdhirsh · 2025-08-20T02:42:51Z

The idea behind this PR is that there appears to be an issue in the way that dynamo is inlining the _A2A.apply used in EP:

(1) @ruisizhang123 / @anijain2305 noticed that the MoE code wasn't getting a proper backward graph generated for it when being run with compile (notes here: https://docs.google.com/document/d/1l1PfSrhJRXRLa1usBsEh4YSidgJiDU_eVNJ01p7PetM/edit?tab=t.0), running on tip-of-main with this PR, and this command: CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/debug_model.toml" ./run_train.sh --model.name deepseekv3_simple_fsdp --profiling.enable_profiling --parallelism.expert_parallel_degree 2 --training.compile

(2) old tlparse showed that there are no aten._grouped_mm calls in the backward graph, showing that we are not properly differentiating the MoE code (tlparse)

if you look at the dynamo graph in that tlparse, though, you'll see nodes for _c10d_functional.all_to_all_single and _c10d_functional.wait_tensor. This is a problem - the dynamo graph should be capturing the entire autograd.Function as a HOP, since those two ops are not differentiable.

(3) I didn't root cause why dynamo is "ignoring" the autograd.Function. I have some suspicion about the fact that the autograd.Function in question accepts a ProcessGroup.

Instead, I just (a) wrote a slightly different version of the autograd.Function that doesn't accept ProcessGroup, and (b) allow-in-graph'd it for good measure, to guarantee that dynamo won't incorrectly inline it.

As a followup, we should either debug why dynamo is not constructing a HOP for the autograd.Function, or kill the _A2A.apply code entirely as per this comment

Here's a good tlparse with my changes, where you can see proper aten._grouped_mm nodes in the backward graph: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/hirsheybar/9b03e768-73a1-4154-b7af-dcd79cfc754d/custom/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

Stack from ghstack (oldest at bottom):

-> workarounds for all2all autograd issues that Ruisi ran into #1604

[ghstack-poisoned]

ghstack-source-id: 800eed7 Pull Request resolved: #1604

The idea behind this PR is that there appears to be an issue in the way that dynamo is inlining the `_A2A.apply` used in EP: (1) ruisizhang123 / anijain2305 noticed that the MoE code wasn't getting a proper backward graph generated for it when being run with compile (notes here: https://docs.google.com/document/d/1l1PfSrhJRXRLa1usBsEh4YSidgJiDU_eVNJ01p7PetM/edit?tab=t.0), running on tip-of-main with this [PR](https://github.com/pytorch/torchtitan/pull/1529/files), and this command: `CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/debug_model.toml" ./run_train.sh --model.name deepseekv3_simple_fsdp --profiling.enable_profiling --parallelism.expert_parallel_degree 2 --training.compile` (2) old tlparse showed that there are no `aten._grouped_mm` calls in the backward graph, showing that we are not properly differentiating the MoE code ([tlparse](https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpxMfgwf/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000)) if you look at the dynamo graph in that tlparse, though, you'll see nodes for `_c10d_functional.all_to_all_single` and `_c10d_functional.wait_tensor`. This is a problem - the dynamo graph should be capturing the entire autograd.Function as a HOP, since those two ops are not differentiable. (3) I didn't root cause why dynamo is "ignoring" the `autograd.Function`. I have some suspicion about the fact that the autograd.Function in question accepts a `ProcessGroup`. Instead, I just (a) wrote a slightly different version of the autograd.Function that doesn't accept ProcessGroup, and (b) allow-in-graph'd it for good measure, to guarantee that dynamo won't incorrectly inline it. As a followup, we should either debug why dynamo is not constructing a HOP for the autograd.Function, or kill the _A2A.apply code entirely as per this [comment](#1467 (comment)) Here's a good tlparse with my changes, where you can see proper `aten._grouped_mm` nodes in the backward graph: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/hirsheybar/9b03e768-73a1-4154-b7af-dcd79cfc754d/custom/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 [ghstack-poisoned]

ghstack-source-id: 035c32c Pull Request resolved: #1604

xmfan · 2025-08-20T14:25:50Z

@bdhirsh I have a fix for the root cause here: pytorch/pytorch#161036

@anijain2305

…1036) Fixes silent incorrectness for autograd function tracing, where we rely on FakeTensor metadata (requires_grad) to determine whether to HOP or not: https://github.com/pytorch/pytorch/blob/5ee464db5c4293ac09521f9069fa7d2106680a7f/torch/_dynamo/variables/misc.py#L671 Stared at this with @anijain2305 yesterday, `Tensor.__setitem__` can update tensor metadata, and we can just run the fake prop and extract the output metadata from the updated FakeTensor. FIXES #160901 It should also be the root cause behind the issue in pytorch/torchtitan#1604 @bdhirsh @ruisizhang123 Pull Request resolved: #161036 Approved by: https://github.com/anijain2305 ghstack dependencies: #160805

workarounds for all2all autograd issues that Ruisi ran into

221bdaf

[ghstack-poisoned]

bdhirsh requested review from tianyu-l, fegin, wwwjn and wconstab as code owners August 20, 2025 02:42

bdhirsh added a commit that referenced this pull request Aug 20, 2025

workarounds for all2all autograd issues that Ruisi ran into

7854389

ghstack-source-id: 800eed7 Pull Request resolved: #1604

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 20, 2025

bdhirsh added a commit that referenced this pull request Aug 20, 2025

workarounds for all2all autograd issues that Ruisi ran into

8bb6371

ghstack-source-id: 035c32c Pull Request resolved: #1604

tianyu-l mentioned this pull request Aug 20, 2025

Wrap sync + a2a in a custom op #1597

Open

xmfan mentioned this pull request Aug 20, 2025

[dynamo] propagate tensor metadata on Tensor.__setitem__(tensor) pytorch/pytorch#161036

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

workarounds for all2all autograd issues that Ruisi ran into #1604

workarounds for all2all autograd issues that Ruisi ran into #1604

Uh oh!

bdhirsh commented Aug 20, 2025 •

edited

Loading

Uh oh!

xmfan commented Aug 20, 2025 •

edited

Loading

Uh oh!

Uh oh!

workarounds for all2all autograd issues that Ruisi ran into #1604

Are you sure you want to change the base?

workarounds for all2all autograd issues that Ruisi ran into #1604

Uh oh!

Conversation

bdhirsh commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xmfan commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

bdhirsh commented Aug 20, 2025 •

edited

Loading

xmfan commented Aug 20, 2025 •

edited

Loading