Skip to content

Conversation

bdhirsh
Copy link

@bdhirsh bdhirsh commented Aug 20, 2025

The idea behind this PR is that there appears to be an issue in the way that dynamo is inlining the _A2A.apply used in EP:

(1) @ruisizhang123 / @anijain2305 noticed that the MoE code wasn't getting a proper backward graph generated for it when being run with compile (notes here: https://docs.google.com/document/d/1l1PfSrhJRXRLa1usBsEh4YSidgJiDU_eVNJ01p7PetM/edit?tab=t.0), running on tip-of-main with this PR, and this command: CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/debug_model.toml" ./run_train.sh --model.name deepseekv3_simple_fsdp --profiling.enable_profiling --parallelism.expert_parallel_degree 2 --training.compile

(2) old tlparse showed that there are no aten._grouped_mm calls in the backward graph, showing that we are not properly differentiating the MoE code (tlparse)

if you look at the dynamo graph in that tlparse, though, you'll see nodes for _c10d_functional.all_to_all_single and _c10d_functional.wait_tensor. This is a problem - the dynamo graph should be capturing the entire autograd.Function as a HOP, since those two ops are not differentiable.

(3) I didn't root cause why dynamo is "ignoring" the autograd.Function. I have some suspicion about the fact that the autograd.Function in question accepts a ProcessGroup.

Instead, I just (a) wrote a slightly different version of the autograd.Function that doesn't accept ProcessGroup, and (b) allow-in-graph'd it for good measure, to guarantee that dynamo won't incorrectly inline it.

As a followup, we should either debug why dynamo is not constructing a HOP for the autograd.Function, or kill the _A2A.apply code entirely as per this comment

Here's a good tlparse with my changes, where you can see proper aten._grouped_mm nodes in the backward graph: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/hirsheybar/9b03e768-73a1-4154-b7af-dcd79cfc754d/custom/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

Stack from ghstack (oldest at bottom):

bdhirsh added a commit that referenced this pull request Aug 20, 2025
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 20, 2025
The idea behind this PR is that there appears to be an issue in the way that dynamo is inlining the `_A2A.apply` used in EP:

(1)  ruisizhang123 / anijain2305 noticed that the MoE code wasn't getting a proper backward graph generated for it when being run with compile (notes here: https://docs.google.com/document/d/1l1PfSrhJRXRLa1usBsEh4YSidgJiDU_eVNJ01p7PetM/edit?tab=t.0), running on tip-of-main with this [PR](https://github.com/pytorch/torchtitan/pull/1529/files), and this command: `CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/debug_model.toml" ./run_train.sh --model.name deepseekv3_simple_fsdp --profiling.enable_profiling --parallelism.expert_parallel_degree 2 --training.compile` 

(2) old tlparse showed that there are no `aten._grouped_mm` calls in the backward graph, showing that we are not properly differentiating the MoE code ([tlparse](https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpxMfgwf/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000))

if you look at the dynamo graph in that tlparse, though, you'll see nodes for `_c10d_functional.all_to_all_single` and `_c10d_functional.wait_tensor`. This is a problem - the dynamo graph should be capturing the entire autograd.Function as a HOP, since those two ops are not differentiable.

(3) I didn't root cause why dynamo is "ignoring" the `autograd.Function`. I have some suspicion about the fact that the autograd.Function in question accepts a `ProcessGroup`.

Instead, I just (a) wrote a slightly different version of the autograd.Function that doesn't accept ProcessGroup, and (b) allow-in-graph'd it for good measure, to guarantee that dynamo won't incorrectly inline it.

As a followup, we should either debug why dynamo is not constructing a HOP for the autograd.Function, or kill the _A2A.apply code entirely as per this [comment](#1467 (comment))

Here's a good tlparse with my changes, where you can see proper `aten._grouped_mm` nodes in the backward graph: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/hirsheybar/9b03e768-73a1-4154-b7af-dcd79cfc754d/custom/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000




[ghstack-poisoned]
bdhirsh added a commit that referenced this pull request Aug 20, 2025
@xmfan
Copy link
Member

xmfan commented Aug 20, 2025

@bdhirsh I have a fix for the root cause here: pytorch/pytorch#161036

pytorchmergebot pushed a commit to pytorch/pytorch that referenced this pull request Aug 22, 2025
…1036)

Fixes silent incorrectness for autograd function tracing, where we rely on FakeTensor metadata (requires_grad) to determine whether to HOP or not: https://github.com/pytorch/pytorch/blob/5ee464db5c4293ac09521f9069fa7d2106680a7f/torch/_dynamo/variables/misc.py#L671

Stared at this with @anijain2305 yesterday, `Tensor.__setitem__` can update tensor metadata, and we can just run the fake prop and extract the output metadata from the updated FakeTensor.

FIXES #160901

It should also be the root cause behind the issue in pytorch/torchtitan#1604 @bdhirsh  @ruisizhang123

Pull Request resolved: #161036
Approved by: https://github.com/anijain2305
ghstack dependencies: #160805
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants