Description
Prerequisites
- I am running the latest code. Mention the version if possible as well.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new and useful enhancement to share.
Feature Description
Support for a new model from Huawei: https://huggingface.co/IntervitensInc/pangu-pro-moe-model
https://gitcode.com/ascend-tribe/pangu-pro-moe-model
Motivation
It's seemingly optimized for even multi-device inference:
We proposed a new type of Mixture of Grouped Experts (MoGE), which groups experts in the expert selection stage and constrains tokens to activate equal experts in each group, thereby achieving natural load balancing between devices. Based on the MoGE architecture, we built a Pangu Pro MoE model with a total parameter size of 72B and an activation parameter size of 16B:
MoGE configuration: 4 shared experts, 64 routing experts divided into 8 groups, and 1 expert activated in each group
Pre-training: 15T
So a consistent 1 expert per devices across 8 devices, which is neat.
New models just wont stop coming!