QMoE CUDA EP — FP4/FP8/WFP4AFP8 Quantized Mixture-of-Experts + MoE GEMM Refactor by tianleiwu · Pull Request #28467 · microsoft/onnxruntime

tianleiwu · 2026-05-12T00:16:03Z

Description

Update QMoE contrib operator for the CUDA EP to supports quantized Mixture-of-Experts inference with INT4, INT8, FP4 (MXFP4 e2m1), FP8 (e4m3fn), and WFP4AFP8 (mixed FP4 weight × FP8 activation) quantization formats.

This also refactors the existing MoE GEMM infrastructure to support TMA warp-specialized grouped GEMM on Hopper (SM90), native MXFP4 on Blackwell (SM120), and block-scaled tensor ops on SM100+, with automatic fallback to dequantization on older architectures.

Note that this is modified from TensorRT-LLM MoE implementation. There is a section in moe_qmoe.md about the modifications.

Summary of Changes

New QMoE Operator

File	Change
`onnxruntime/core/graph/contrib_ops/contrib_defs.cc`	Register `QMoE` op schema (com.microsoft domain, opset 1)
`onnxruntime/contrib_ops/cuda/moe/moe_quantization.cc/h`	QMoE CUDA kernel implementation with dynamic runner selection
`onnxruntime/contrib_ops/cuda/moe/qmoe_kernels.cu/h`	Softmax top-k router, sparse mixer, zero-point pre-packing kernels
`onnxruntime/contrib_ops/cuda/moe/moe_base.h`	Shared MoE base class updates for quantization attributes
`docs/contrib_ops/cuda/moe_qmoe.md`	Comprehensive operator documentation (inputs, attributes, quantization formats)

MoE GEMM Refactor

File	Change
`onnxruntime/contrib_ops/cuda/llm/moe_gemm/moe_gemm_kernels.h`	Unified `CutlassMoeFCRunner` template with FP4/FP8/WFP4AFP8 specializations
`onnxruntime/contrib_ops/cuda/llm/moe_gemm/moe_gemm_template_dispatch.h`	Three-family dispatch: Ampere GemmGrouped, TMA warp-specialized, block-scaled tensor ops
`onnxruntime/contrib_ops/cuda/llm/moe_gemm/moe_gemm_profiler.cc/h`	MoE-specific GEMM tactic profiler for auto-tuning
`onnxruntime/contrib_ops/cuda/llm/moe_gemm/common.h`	Shared MoE GEMM types and config structs
`onnxruntime/contrib_ops/cuda/llm/moe_gemm/launchers/`	SM80/SM90/SM120 launcher instantiations (including generated .cu files)

CUTLASS Extensions

File	Change
`onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/arch/`	Grid dependency control, TMA copy traits, multi-mem copy operations
`onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/gemm/collective/`	Mixed-input and gated GEMM collective builders for SM90
`onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/gemm/kernel/`	Fused MoE kernel traits/routines, MoE problem visitors, gated GEMM kernels
`onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/epilogue/`	MoE finalize epilogue, per-row/per-col scale epilogues
`onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/system_barrier.h`	System barrier for multi-CTA synchronization

Common CUDA Utilities

onnxruntime/contrib_ops/cuda/llm/common/cuda_fp8_utils.cu/h — FP8 conversion, quantization, dequantization kernels
onnxruntime/contrib_ops/cuda/llm/common/memory_utils.cu/h — Device memory transpose, permute, type conversion utilities
onnxruntime/contrib_ops/cuda/llm/common/cuda_type_utils.cuh — Unified type traits for half/bfloat16/float/fp8/fp4
onnxruntime/contrib_ops/cuda/llm/common/quantization.h — Quantization parameter structs and helpers
onnxruntime/contrib_ops/cuda/llm/common/reduce_kernel_utils.cuh — Warp/block reduction primitives
onnxruntime/contrib_ops/cuda/llm/kernels/quantization.cuh — FP4/FP8 quantization kernels
onnxruntime/contrib_ops/cuda/llm/kernels/pre_quant_scale_kernel.cu/h — Pre-quantization scaling kernel

GEMM Profiler Refactor

File	Change
`onnxruntime/contrib_ops/cuda/llm/gemm_profiler.cc/h`	Refactored GEMM profiler interface for tactic selection
`onnxruntime/contrib_ops/cuda/llm/cutlass_heuristic.cc/h`	Updated heuristics for new kernel families
`onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/gemm_configs.h`	Extended GEMM config enums for TMA warp-specialized and gated configs

Build System

File	Change
`cmake/CMakeLists.txt`	Add `ENABLE_FP4`, `ENABLE_FP8`, `ENABLE_CUDA_FP4_QMOE`, `ORT_QUICK_BUILD`, `PLACEHOLDER_KERNELS` options
`cmake/external/cuda_configuration.cmake`	FP4/FP8 capability detection based on CUDA version and SM arch
`cmake/external/cutlass.cmake`	CUTLASS version bump
`cmake/onnxruntime_providers_cuda.cmake`	Add MoE GEMM source files and conditional FP4/FP8 kernel compilation
`cmake/onnxruntime_python.cmake`	Add `onnxruntime_pybind_quant.cc` for Python quantization bindings

Python Quantization Bindings

File	Change
`onnxruntime/python/onnxruntime_pybind_quant.cc`	C++ pybind module for MoE weight preprocessing (quantize, pack, preprocess)
`onnxruntime/python/tools/quantization/quant_utils.py`	FP4/FP8 quantization utilities
`setup.py`	Include new pybind module in package build

Tests

File	Change
`onnxruntime/test/python/transformers/test_qmoe_cuda.py`	INT4/INT8 QMoE tests (Phi3 topology, SwiGLU, blockwise, asymmetric)
`onnxruntime/test/python/transformers/test_qmoe_fp4_cuda.py`	MXFP4 QMoE tests
`onnxruntime/test/python/transformers/test_qmoe_fp8_cuda.py`	FP8 QMoE tests
`onnxruntime/test/python/transformers/test_qmoe_wfp4afp8_cuda.py`	WFP4AFP8 mixed-precision QMoE tests
`onnxruntime/test/python/transformers/test_moe_cuda.py`	Updated existing MoE tests for refactored infrastructure
`onnxruntime/test/contrib_ops/moe_test.cc`	C++ MoE unit tests updated

Existing MoE Refactor

onnxruntime/contrib_ops/cuda/moe/moe.cc/h — Refactored to share base with QMoE
onnxruntime/contrib_ops/cuda/moe/ft_moe/ → onnxruntime/contrib_ops/cuda/llm/moe_gemm/ — Relocated and rewritten MoE GEMM kernels
Removed old cuda/quantization/moe_quantization.cc/h in favor of new cuda/moe/moe_quantization.cc/h

Testing

INT4/INT8 QMoE: python -m pytest onnxruntime/test/python/transformers/test_qmoe_cuda.py -v (requires CUDA GPU, SM75+)
FP4 QMoE: python -m pytest onnxruntime/test/python/transformers/test_qmoe_fp4_cuda.py -v (requires SM120+ for native, falls back on older)
FP8 QMoE: python -m pytest onnxruntime/test/python/transformers/test_qmoe_fp8_cuda.py -v (requires SM90+ for native)
WFP4AFP8 QMoE: python -m pytest onnxruntime/test/python/transformers/test_qmoe_wfp4afp8_cuda.py -v (requires SM100+)
Existing MoE: python -m pytest onnxruntime/test/python/transformers/test_moe_cuda.py -v
C++ MoE tests: Build with CUDA EP enabled, run onnxruntime_test_all --gtest_filter=*MoE*
All tests compare QMoE output against PyTorch reference implementations with configurable tolerance

Motivation and Context

Modern LLMs increasingly use Mixture-of-Experts architectures (e.g., Mixtral, DeepSeek, Phi-3.5-MoE) for efficient scaling. These models benefit significantly from weight quantization to reduce memory bandwidth and enable larger models on fewer GPUs. This PR:

Adds native low-precision MoE support — FP4 and FP8 quantized weights avoid the dequantization overhead of INT4/INT8 on supported hardware (Hopper, Blackwell).
Introduces WFP4AFP8 — A novel mixed-precision mode where weights are MXFP4 and activations are dynamically quantized to FP8, enabling 2× weight compression with minimal accuracy loss on Blackwell GPUs.
Refactors MoE GEMM infrastructure — The previous FasterTransformer-derived MoE GEMM code is replaced with a modern CUTLASS 4.x-based dispatch system supporting three kernel families across SM75–SM120+.
Adds auto-tuning — The GEMM profiler enables runtime tactic selection for optimal performance across different expert sizes and batch configurations.

Copilot

Copilot wasn't able to review this pull request because it exceeds the maximum number of lines (20,000). Try reducing the number of changed lines and requesting a review from Copilot again.

…-fp4 A-bf16 kernel

QMoE CUDA FP4/FP8/WFP4AFP8 + MoE Refactor

17bd084

tianleiwu requested a review from Copilot May 12, 2026 00:25

Copilot AI reviewed May 12, 2026

View reviewed changes

tianleiwu and others added 5 commits May 12, 2026 12:46

refine

3b1c415

skip test if not enabled in build

3256b22

fix build

2fce3c4

update op doc

bf919f2

Merge remote-tracking branch 'origin/main' into tlwu/20260511/qmoe_cuda

6abe932

tianleiwu marked this pull request as ready for review May 15, 2026 01:08

fix build

b3143e0

tianleiwu force-pushed the tlwu/20260511/qmoe_cuda branch from bbc0af2 to b3143e0 Compare May 17, 2026 17:01

tianleiwu added 5 commits May 17, 2026 14:01

refine build

6b762ca

Do not link cuda in pybind for Windows

6b8dfc3

share source filters between cuda and cuda plugin

34a988f

remove commented code; add license header.

8d5aef5

remove fc3

8f9f41b

tianleiwu marked this pull request as draft May 18, 2026 07:42

tianleiwu added 3 commits May 18, 2026 09:23

remove fc3_global_scale; use float8e8m0; wfp4afp8 blackwell support

d77e10c

K=128 tile support, epilogue fusion, expanded tile configs for SM90 W…

dc60990

…-fp4 A-bf16 kernel

update op docs

16ca767

tianleiwu force-pushed the tlwu/20260511/qmoe_cuda branch from 6c72e8c to 16ca767 Compare May 18, 2026 17:53

github-advanced-security AI found potential problems May 18, 2026

View reviewed changes

Comment thread onnxruntime/test/python/transformers/test_qmoe_cuda.py Fixed

Comment thread onnxruntime/test/python/transformers/test_qmoe_cuda.py Dismissed

allow test cuda plugin

b945a10

github-advanced-security AI found potential problems May 18, 2026

View reviewed changes

Comment thread onnxruntime/test/python/transformers/test_qmoe_fp4_cuda.py Fixed

Comment thread onnxruntime/test/python/transformers/test_qmoe_fp8_cuda.py Fixed

Comment thread onnxruntime/test/python/transformers/test_qmoe_wfp4afp8_cuda.py Fixed

tianleiwu added 2 commits May 18, 2026 21:00

lintrunner

2580eaa

clean up

aadd7b0

tianleiwu marked this pull request as ready for review May 19, 2026 07:05

tianleiwu requested review from apsonawane and kunal-vaishnavi May 19, 2026 07:13

remove unused code

fb92380

change testing cuda plugin default to be False

fc18a8e

kunal-vaishnavi reviewed May 20, 2026

View reviewed changes

Comment thread cmake/CMakeLists.txt

kunal-vaishnavi reviewed May 20, 2026

View reviewed changes

Comment thread docs/contrib_ops/cuda/moe_qmoe.md

kunal-vaishnavi reviewed May 20, 2026

View reviewed changes

Comment thread docs/contrib_ops/cuda/moe_qmoe.md

kunal-vaishnavi reviewed May 20, 2026

View reviewed changes

Comment thread onnxruntime/contrib_ops/cuda/moe/moe_quantization.cc

kunal-vaishnavi approved these changes May 20, 2026

View reviewed changes

apsonawane reviewed May 20, 2026

View reviewed changes

Comment thread onnxruntime/contrib_ops/cuda/moe/qmoe_kernels.cu

apsonawane reviewed May 20, 2026

View reviewed changes

Comment thread onnxruntime/contrib_ops/cuda/moe/moe_quantization.cc

apsonawane reviewed May 20, 2026

View reviewed changes

Comment thread onnxruntime/contrib_ops/cuda/moe/moe_quantization.cc

apsonawane reviewed May 20, 2026

View reviewed changes

Comment thread onnxruntime/contrib_ops/cuda/moe/moe_quantization.cc

tianleiwu enabled auto-merge (squash) May 20, 2026 05:56

tianleiwu merged commit 548ab6e into main May 20, 2026
91 of 95 checks passed

tianleiwu deleted the tlwu/20260511/qmoe_cuda branch May 20, 2026 05:56

tianleiwu mentioned this pull request May 20, 2026

QMoE CUDA: Rename build options, refactor PrePack, add GPU kernels #28583

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QMoE CUDA EP — FP4/FP8/WFP4AFP8 Quantized Mixture-of-Experts + MoE GEMM Refactor#28467

QMoE CUDA EP — FP4/FP8/WFP4AFP8 Quantized Mixture-of-Experts + MoE GEMM Refactor#28467
tianleiwu merged 20 commits into
mainfrom
tlwu/20260511/qmoe_cuda

tianleiwu commented May 12, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

tianleiwu commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Summary of Changes

New QMoE Operator

MoE GEMM Refactor

CUTLASS Extensions

Common CUDA Utilities

GEMM Profiler Refactor

Build System

Python Quantization Bindings

Tests

Existing MoE Refactor

Testing

Motivation and Context

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

tianleiwu commented May 12, 2026 •

edited

Loading