Initial compile support for llama4 #1365

xmfan · 2025-07-03T16:30:05Z

Status

🚧 NEED pytorch/pytorch changes 🚧
- [c10d] support dynamic shapes for all_to_all_single_autograd pytorch#157521
- Fix grouped MM output strides when compiled but not max-autotuned pytorch#158143
tlparse: https://fburl.com/6g9djppf
- staticmethod on user-defined classes can not be generically supported, I moved those out.
- Remaining graph breaks are due to:
  - MoE experts being wrapped by FSDP.
    - FSDP hooks are explicitly disabled, we are graph breaking when the nn.Module.call invokes the forward pre and post hooks.
    - Since we will not be tracing FSDP wrapper, these graph breaks are considered fundamental until we migrate to SimpleFSDP.
  - Not technically a graph break, but the EP DTensor hooks are being traced in their own graphs. Still needs investigation.

We don't have a good way in compile to specify fullgraph=True except for FSDP hooks at the moment. We can either leave it fullgraph=False or just wrap the experts model code in set_fullgraph(False)/set_fullgraph(True).

Repro

tested on debug model NGPU=2 CONFIG_FILE="./torchtitan/experiments/llama4/train_configs/debug_model.toml" ./run_train.sh --parallelism.data_parallel_shard_degree=2 --parallelism.expert_parallel_degree=2 --training.compile
logs: https://gist.github.com/xmfan/41b822d9f09eb07fee62d684a061cec1

memory: 2.20GiB -> 1.42GiB
speedup: no big change, need to check with actual model

tianyu-l · 2025-07-16T19:35:57Z

pls try rebase #1403
which has non-persistent buffer for load-balancing fields

This is the original issue I mentioned
https://github.com/pytorch/torchtitan/blob/main/torchtitan/experiments/llama4/model/model.py#L382
where the freqs_cis should also be non-persistent.

tianyu-l

Left some questions.

Could also address #1365 (comment)

rebase and see if the non-persistent buffer tokens_per_expert is causing trouble
manually try change freqs_cis to non-persistent and see if the issue is still there. https://github.com/pytorch/torchtitan/blob/main/torchtitan/experiments/llama4/model/model.py#L388

tianyu-l · 2025-07-24T20:48:46Z

torchtitan/experiments/llama4/infra/parallelize.py

+    Apply torch.compile to each TransformerBlock, which makes compilation efficient due to
+    repeated structure. Alternatively one can compile the whole model (after applying DP).
+    """
+    torch._dynamo.config.fail_on_recompile_limit_hit = True


What is this for?
Other than this, it seems we can just apply the same function llama 3 uses.

this is to loud error if we recompile more than 8 times (default). currently, we would just silently fallback to eager if it happens.

should we do the same to Llama 3? If so we can still reuse this function

tianyu-l · 2025-07-24T20:51:29Z

torchtitan/experiments/llama4/model/moe.py

                self.w1, self.w2, self.w3, x, num_tokens_per_expert
            )

-    # TODO: keeping this for-loop implementation for comparison


staticmethod on user-defined classes can not be generically supported, I moved those out.

Could you explain more? Does it mean if we move them out, then torch.compile can trace them in the same graph as the caller module is in?

tianyu-l · 2025-07-24T20:52:49Z

torchtitan/experiments/llama4/model/moe.py

@@ -28,89 +97,21 @@ def __init__(
        self.w3 = nn.Parameter(torch.empty(num_experts, dim, hidden_dim))
        self.use_grouped_mm = use_grouped_mm

+    @torch._dynamo.set_fullgraph(True)


what is this annotation for?

Compiling the block with fullgraph=False could allow graph breaks to creep in silently with dynamo changes, and we wouldn't know about them until we manually inspect the graph or suspect QPS to have regressed.

This API to more granularly control the fullgraph argument of torch.compile, you can flip it on and off within a compiled region. In this case, we allow graph breaks between GroupedExperts.call and GroupedExperts.forward, i.e. allow graph break on the forward hooks from FSDP

In addition to FSDP comms, EP a2a also happens before & after GroupedExperts.forward. Does it mean it's still not fine-grained enough to capture graphs in EP?

tianyu-l · 2025-07-24T20:54:19Z

torchtitan/experiments/llama4/model/moe.py

@@ -297,7 +298,8 @@ def forward(self, x: torch.Tensor) -> torch.Tensor:
        )

        # shape (bs*slen*top_k, dim)
-        routed_output = self.experts(routed_input, num_tokens_per_expert)
+        with torch._dynamo.set_fullgraph(False):


IIUC, this annotation is for the FSDP caused graph break, correct?
Can we possibly incur this in the apply_compile function. Technically this change is model-intrusively, despite being small.

This API can't decorate GroupedExperts.call right now. If it's a problem, we can just compile MoE with fullgraph=False

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jul 3, 2025

tianyu-l force-pushed the ep branch from 08c1ff1 to b0dffa1 Compare July 8, 2025 04:27

Base automatically changed from ep to main July 8, 2025 16:47

xmfan force-pushed the ep-compile branch from 050291b to 2ef0e26 Compare July 9, 2025 23:48

xmfan mentioned this pull request Jul 11, 2025

[inductor] grouped_mm is autotuning under torch.compile default mode pytorch/pytorch#158042

Closed

compile, with pytorch/pytorch 26807dcf277

06725f1

xmfan force-pushed the ep-compile branch from 3cde52e to 06725f1 Compare July 16, 2025 17:50

xmfan marked this pull request as ready for review July 16, 2025 17:51

xmfan requested a review from tianyu-l July 16, 2025 17:51

xmfan changed the title ~~[WIP] Compile for dp2ep~~ Initial compile support for llama4 Jul 16, 2025

tianyu-l reviewed Jul 24, 2025

View reviewed changes

tianyu-l mentioned this pull request Aug 2, 2025

[llama4] add apply_compile for moe, where fullgraph=False for moe layers #1519

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Initial compile support for llama4 #1365

Initial compile support for llama4 #1365

Uh oh!

xmfan commented Jul 3, 2025 •

edited

Loading

Uh oh!

tianyu-l commented Jul 16, 2025

Uh oh!

tianyu-l left a comment

Uh oh!

tianyu-l Jul 24, 2025

Uh oh!

xmfan Jul 24, 2025

Uh oh!

tianyu-l Aug 2, 2025

Uh oh!

tianyu-l Jul 24, 2025

Uh oh!

xmfan Jul 24, 2025

Uh oh!

tianyu-l Jul 24, 2025

Uh oh!

xmfan Jul 24, 2025

Uh oh!

tianyu-l Aug 2, 2025

Uh oh!

tianyu-l Jul 24, 2025

Uh oh!

xmfan Jul 24, 2025

Uh oh!

Uh oh!

Initial compile support for llama4 #1365

Are you sure you want to change the base?

Initial compile support for llama4 #1365

Uh oh!

Conversation

xmfan commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Status

Repro

Uh oh!

tianyu-l commented Jul 16, 2025

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

xmfan commented Jul 3, 2025 •

edited

Loading