Skip to content

Conversation

fegin
Copy link
Contributor

@fegin fegin commented Aug 26, 2025

NoParallel should not belong expert_parallel.py. This PR moves it to torchtitan.distributed.__init__.py.

NoParallel should not belong `expert_parallel`. This PR moves it to `torchtitan.distributed.__init__.py`.
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 26, 2025
Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought more and I would still recommend we put NoParallel into tensor_parallel.py. Here's my reasoning:

We use NoParallel in following cases:

  1. https://github.com/pytorch/torchtitan/blob/main/torchtitan/experiments/llama4/infra/parallelize.py#L432-L444 In MoE layer moe.router.gate: together with moe.shared_experts, this is outside EP region (moe.experts). The only place we are applying NoParallel is when TP is used for non-EP params (including moe.router.gate, moe.shared_experts, and attention / mlp layers). So it has nothing to do with ExpertParallel.
  2. In DeepSeek V3 (https://github.com/pytorch/torchtitan/blob/main/torchtitan/models/deepseek_v3/infra/parallelize.py#L208) and Qwen3 (https://github.com/pytorch/torchtitan/blob/main/torchtitan/experiments/qwen3/infra/parallelize.py#L200), where this is used in TP plan.

Technically I wrote this NoParallel class to deal with mesh mismatch between such modules with other TP'lized modules, so the reason we have them is purely because of TP.

For EP, mesh mismatch anyway exists and can't be solved by this NoParallel class. In gradient clipping, we separate treatment of EP mesh and non-EP mesh where NoParallel helps align all non-EP params on their TP mesh.

@wconstab
Copy link
Contributor

Just for my understanding- this is only needed because we break in and out of dtensor at the TP boundaries right?

Remind me why we do this, was it to avoid missing dtensor operators, or performance, or ...?

@fegin
Copy link
Contributor Author

fegin commented Aug 26, 2025

Remind me why we do this, was it to avoid missing dtensor operators, or performance, or ...?
@wconstab We need to replicate weights sometimes. And we either use NoParallel or plain tensors. If the input is DTensor, we have to use NoParallel.

@fegin
Copy link
Contributor Author

fegin commented Aug 26, 2025

@tianyu-l

NoParallel itself is more like a compliment of DTensor, where we don't support DTensor + Tensor operation, (implicit replication? cc., @wconstab). While it is primarily used by TP, it is semantically more close be a general util. That's why I put it in __init__. But I don't have a strong opinion on this. We can move it to TP and if we have a more general use case, then we can move it to __init__.py.

@tianyu-l
Copy link
Contributor

@fegin
I agree this is general, no tied to TP.
Maybe I just feel grouping it with TP / EP makes slightly more sense then putting it as the root file, together with PP, ParallelDims.

I'm OK either way. Up to you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants