Hello :)
I’d like to use Tutel as the MoE layer implementation in Nanotron to train a Qwen3-MoE 15B model from scratch with 128 experts and top-k = 8.
Cluster with SLURM: up to 256 nodes
GPUs: 4× A100 64 GB per node.
Goal: scale across 32–1,024 GPUs with EP/TP/DP/PP.
-
Is a similiar configuration (maybe for wen3-30B-A3B) supported out-of-the-box, or are patches required to enable a Tutel backend (e.g., a moe_config.backend: tutel switch)?
-
Recommended parallelism layout (EP/TP/PP/DP) for 32–1,024 GPUs with 128 experts and k=8. Any guidance on expert placement to minimize all-to-all across nodes?
Many thanks!