Replies: 2 comments
-
|
I think the allgather dispatcher first gathers all tokens, which is [s*e] tokens, to the local rank, and then uses a mask to identify the tokens needed by the local experts. |
Beta Was this translation helpful? Give feedback.
0 replies
-
|
Same question. The gather generates EP copies of routing map for the full tokens: B*S. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I have a node with 8 GPUs. The model has 8 experts and I use TP=1, EP=8, with sequence parallel on. Then I expect each GPU has one expert. I use the
MoEAllGatherTokenDispatcher. The size of the hidden_states is [S/TP, B, H] fortoken_permutation, which is actually [S, B, H] because TP=1. Then why we still needtensor_parallel.gather_from_sequence_parallel_region_to_moeto gather and form aglobal_hidden_states, whose size will become [S * EP, B, H]? In my view, each rank has a copy of the [S, B, H] hidden_states, why there is still a need for all gather?There are only B * S tokens to compute, but now each rank has B * S * EP tokens, because we use get_tensor_and_expert_parallel_group() and its size is TP * EP=8.
Beta Was this translation helpful? Give feedback.
All reactions