Skip to content

Tutel as an MoE backend in Nanotron for Qwen3-MoE 15B (128 experts, top-k=8) #310

@hahahaahaa

Description

@hahahaahaa

Hello :)
I’d like to use Tutel as the MoE layer implementation in Nanotron to train a Qwen3-MoE 15B model from scratch with 128 experts and top-k = 8.

Cluster with SLURM: up to 256 nodes

GPUs: 4× A100 64 GB per node.

Goal: scale across 32–1,024 GPUs with EP/TP/DP/PP.

  1. Is a similiar configuration (maybe for wen3-30B-A3B) supported out-of-the-box, or are patches required to enable a Tutel backend (e.g., a moe_config.backend: tutel switch)?

  2. Recommended parallelism layout (EP/TP/PP/DP) for 32–1,024 GPUs with 128 experts and k=8. Any guidance on expert placement to minimize all-to-all across nodes?

Many thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions