[BE] Move NoParallel to torchtitan.distributed #1641

fegin · 2025-08-26T05:38:39Z

NoParallel should not belong expert_parallel.py. This PR moves it to torchtitan.distributed.__init__.py.

NoParallel should not belong `expert_parallel`. This PR moves it to `torchtitan.distributed.__init__.py`.

tianyu-l

I thought more and I would still recommend we put NoParallel into tensor_parallel.py. Here's my reasoning:

We use NoParallel in following cases:

https://github.com/pytorch/torchtitan/blob/main/torchtitan/experiments/llama4/infra/parallelize.py#L432-L444 In MoE layer moe.router.gate: together with moe.shared_experts, this is outside EP region (moe.experts). The only place we are applying NoParallel is when TP is used for non-EP params (including moe.router.gate, moe.shared_experts, and attention / mlp layers). So it has nothing to do with ExpertParallel.
In DeepSeek V3 (https://github.com/pytorch/torchtitan/blob/main/torchtitan/models/deepseek_v3/infra/parallelize.py#L208) and Qwen3 (https://github.com/pytorch/torchtitan/blob/main/torchtitan/experiments/qwen3/infra/parallelize.py#L200), where this is used in TP plan.

Technically I wrote this NoParallel class to deal with mesh mismatch between such modules with other TP'lized modules, so the reason we have them is purely because of TP.

For EP, mesh mismatch anyway exists and can't be solved by this NoParallel class. In gradient clipping, we separate treatment of EP mesh and non-EP mesh where NoParallel helps align all non-EP params on their TP mesh.

wconstab · 2025-08-26T14:13:13Z

Just for my understanding- this is only needed because we break in and out of dtensor at the TP boundaries right?

Remind me why we do this, was it to avoid missing dtensor operators, or performance, or ...?

fegin · 2025-08-26T16:51:43Z

Remind me why we do this, was it to avoid missing dtensor operators, or performance, or ...?
@wconstab We need to replicate weights sometimes. And we either use NoParallel or plain tensors. If the input is DTensor, we have to use NoParallel.

fegin · 2025-08-26T16:55:21Z

@tianyu-l

NoParallel itself is more like a compliment of DTensor, where we don't support DTensor + Tensor operation, (implicit replication? cc., @wconstab). While it is primarily used by TP, it is semantically more close be a general util. That's why I put it in __init__. But I don't have a strong opinion on this. We can move it to TP and if we have a more general use case, then we can move it to __init__.py.

tianyu-l · 2025-08-26T18:57:16Z

@fegin
I agree this is general, no tied to TP.
Maybe I just feel grouping it with TP / EP makes slightly more sense then putting it as the root file, together with PP, ParallelDims.

I'm OK either way. Up to you.

[BE] Move NoParallel to torchtitan.distributed

2aab23c

NoParallel should not belong `expert_parallel`. This PR moves it to `torchtitan.distributed.__init__.py`.

fegin requested review from tianyu-l, wwwjn and wconstab as code owners August 26, 2025 05:38

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 26, 2025

tianyu-l reviewed Aug 26, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BE] Move NoParallel to torchtitan.distributed #1641

[BE] Move NoParallel to torchtitan.distributed #1641

Uh oh!

fegin commented Aug 26, 2025

Uh oh!

tianyu-l left a comment

Uh oh!

wconstab commented Aug 26, 2025

Uh oh!

fegin commented Aug 26, 2025

Uh oh!

fegin commented Aug 26, 2025

Uh oh!

tianyu-l commented Aug 26, 2025

Uh oh!

Uh oh!

[BE] Move NoParallel to torchtitan.distributed #1641

Are you sure you want to change the base?

[BE] Move NoParallel to torchtitan.distributed #1641

Uh oh!

Conversation

fegin commented Aug 26, 2025

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

wconstab commented Aug 26, 2025

Uh oh!

fegin commented Aug 26, 2025

Uh oh!

fegin commented Aug 26, 2025

Uh oh!

tianyu-l commented Aug 26, 2025

Uh oh!

Uh oh!