-
Notifications
You must be signed in to change notification settings - Fork 3.9k
QMoE CUDA EP — FP4/FP8/WFP4AFP8 Quantized Mixture-of-Experts + MoE GEMM Refactor #28467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
20 commits
Select commit
Hold shift + click to select a range
17bd084
QMoE CUDA FP4/FP8/WFP4AFP8 + MoE Refactor
tianleiwu 3b1c415
refine
tianleiwu 3256b22
skip test if not enabled in build
tianleiwu 2fce3c4
fix build
tianleiwu bf919f2
update op doc
tianleiwu 6abe932
Merge remote-tracking branch 'origin/main' into tlwu/20260511/qmoe_cuda
tianleiwu b3143e0
fix build
tianleiwu 6b762ca
refine build
tianleiwu 6b8dfc3
Do not link cuda in pybind for Windows
tianleiwu 34a988f
share source filters between cuda and cuda plugin
tianleiwu 8d5aef5
remove commented code; add license header.
tianleiwu 8f9f41b
remove fc3
tianleiwu d77e10c
remove fc3_global_scale; use float8e8m0; wfp4afp8 blackwell support
tianleiwu dc60990
K=128 tile support, epilogue fusion, expanded tile configs for SM90 W…
tianleiwu 16ca767
update op docs
tianleiwu b945a10
allow test cuda plugin
tianleiwu 2580eaa
lintrunner
tianleiwu aadd7b0
clean up
tianleiwu fb92380
remove unused code
tianleiwu fc18a8e
change testing cuda plugin default to be False
tianleiwu File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,43 @@ | ||
| # Copyright (c) Microsoft Corporation. All rights reserved. | ||
| # Licensed under the MIT License. | ||
|
|
||
| # Shared filtering logic for CUDA contrib ops .cu source lists. | ||
| # Both the main CUDA provider and the plugin EP build use identical filtering | ||
| # rules for flash attention (quick build) and MoE GEMM FP4/FP8 kernels. | ||
| # | ||
| # Usage: | ||
| # onnxruntime_filter_cuda_cu_sources(<list_variable_name>) | ||
| # | ||
| # The macro modifies the named list variable in the caller's scope. | ||
|
|
||
| macro(onnxruntime_filter_cuda_cu_sources CU_SRC_LIST) | ||
| # Quick build mode: Filter flash attention kernels for faster development iteration. | ||
| # - We keep only hdim128 fp16 flash attention kernels in quick build mode. | ||
| # - All other listed head dimensions are excluded (e.g., 32, 64, 96, 192, 256). | ||
| # If new head dimensions are added or removed, update this list to match the supported set. | ||
| if(onnxruntime_QUICK_BUILD) | ||
| message(STATUS "Quick build mode enabled: Only building hdim128 fp16 flash attention kernels") | ||
| list(FILTER ${CU_SRC_LIST} EXCLUDE REGEX "flash_fwd.*hdim(32|64|96|192|256)") | ||
| endif() | ||
|
|
||
| if(NOT onnxruntime_ENABLE_CUDA_FP4_QMOE) | ||
| list(FILTER ${CU_SRC_LIST} EXCLUDE REGEX "moe_gemm_tma_ws_sm90_fp4_.*\\.generated\\.cu") | ||
| list(FILTER ${CU_SRC_LIST} EXCLUDE REGEX "moe_gemm_tma_ws_sm120_fp4_.*\\.generated\\.cu") | ||
| list(FILTER ${CU_SRC_LIST} EXCLUDE REGEX "moe_gemm_tma_ws_sm120_fp8_fp4\\.generated\\.cu") | ||
| list(FILTER ${CU_SRC_LIST} EXCLUDE REGEX "moe_gemm_kernels_(fp16|bf16)_fp4\\.cu") | ||
| list(FILTER ${CU_SRC_LIST} EXCLUDE REGEX "moe_gemm_kernels_fp8_fp4\\.cu") | ||
| else() | ||
| # CUDA 13 PTXAS does not complete the FP4 M=128/N=64 pingpong specializations in | ||
| # this build configuration. The dispatcher routes that tile through cooperative | ||
| # mainloop variants instead, so exclude only those unused generated units. | ||
| list(FILTER ${CU_SRC_LIST} EXCLUDE REGEX "moe_gemm_tma_ws_sm90_fp4_(fp16|bf16)_m128_n64_k[0-9]+_cm[12]_cn[12]_pp(_finalize)?\\.generated\\.cu") | ||
| endif() | ||
|
|
||
| if(NOT onnxruntime_ENABLE_CUDA_FP8_QMOE) | ||
| list(FILTER ${CU_SRC_LIST} EXCLUDE REGEX "moe_gemm_tma_ws_sm90_wfp8_.*\\.generated\\.cu") | ||
| list(FILTER ${CU_SRC_LIST} EXCLUDE REGEX "moe_gemm_tma_ws_sm120_fp4_fp8_.*\\.generated\\.cu") | ||
| list(FILTER ${CU_SRC_LIST} EXCLUDE REGEX "moe_gemm_tma_ws_sm120_fp8_fp4\\.generated\\.cu") | ||
| list(FILTER ${CU_SRC_LIST} EXCLUDE REGEX "moe_gemm_kernels_(fp16|bf16)_fp8\\.cu") | ||
| list(FILTER ${CU_SRC_LIST} EXCLUDE REGEX "moe_gemm_kernels_fp8_fp4\\.cu") | ||
| endif() | ||
| endmacro() |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.