Tutel as an MoE backend in Nanotron for Qwen3-MoE 15B (128 experts, top-k=8)

Hello :)
I’d like to use Tutel as the MoE layer implementation in Nanotron to train a **Qwen3-MoE** 15B model from scratch with 128 experts and top-k = 8.

Cluster with SLURM: up to 256 nodes

GPUs: 4× A100 64 GB per node.

Goal: scale across 32–1,024 GPUs with EP/TP/DP/PP.

1. Is a similiar configuration (maybe for wen3-30B-A3B)  supported out-of-the-box, or are patches required to enable a Tutel backend (e.g., a moe_config.backend: tutel switch)?

2. Recommended parallelism layout (EP/TP/PP/DP) for 32–1,024 GPUs with 128 experts and k=8. Any guidance on expert placement to minimize all-to-all across nodes?


Many thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tutel as an MoE backend in Nanotron for Qwen3-MoE 15B (128 experts, top-k=8) #310

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Tutel as an MoE backend in Nanotron for Qwen3-MoE 15B (128 experts, top-k=8) #310

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions