[QUESTION] all gather for MoE permutation seems redundant? #1071

ZSL98 · 2024-09-04T12:23:19Z

ZSL98
Sep 4, 2024

I have a node with 8 GPUs. The model has 8 experts and I use TP=1, EP=8, with sequence parallel on. Then I expect each GPU has one expert. I use the MoEAllGatherTokenDispatcher. The size of the hidden_states is [S/TP, B, H] for token_permutation, which is actually [S, B, H] because TP=1. Then why we still need tensor_parallel.gather_from_sequence_parallel_region_to_moe to gather and form a global_hidden_states, whose size will become [S * EP, B, H]? In my view, each rank has a copy of the [S, B, H] hidden_states, why there is still a need for all gather?
There are only B * S tokens to compute, but now each rank has B * S * EP tokens, because we use get_tensor_and_expert_parallel_group() and its size is TP * EP=8.

blankde · 2024-09-26T11:05:55Z

blankde
Sep 26, 2024

I think the allgather dispatcher first gathers all tokens, which is [s*e] tokens, to the local rank, and then uses a mask to identify the tokens needed by the local experts.

0 replies

Pengfei-Meng · 2025-07-29T06:33:27Z

Pengfei-Meng
Jul 29, 2025

Same question. The gather generates EP copies of routing map for the full tokens: B*S.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[QUESTION] all gather for MoE permutation seems redundant? #1071

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[QUESTION] all gather for MoE permutation seems redundant? #1071

Uh oh!

Uh oh!

ZSL98 Sep 4, 2024

Replies: 2 comments

Uh oh!

blankde Sep 26, 2024

Uh oh!

Pengfei-Meng Jul 29, 2025

ZSL98
Sep 4, 2024

blankde
Sep 26, 2024

Pengfei-Meng
Jul 29, 2025