[WebGPU] QKV and MLP layer fusions for Qwen3-style models by hariharans29 · Pull Request #28280 · microsoft/onnxruntime

hariharans29 · 2026-04-30T01:30:32Z

Description

Summary

Adds two WebGPU-only graph fusions and the contrib ops they target, plus a small
refactor of the existing MatMulNBits dispatch logic so the new fused kernels
can share its predicates.

Component	Files	Purpose
`MatMulNBitsMlp` op + kernel	`contrib_ops/webgpu/quantization/matmul_nbits_mlp.{cc,h}`, `*.wgsl.template` (3)	Fuses the SwiGLU MLP block: optional `(Skip)SimplifiedLayerNormalization` + two `MatMulNBits` projections (gate, up) + optional biases + `Sigmoid`/`Mul` (SiLU) + element-wise `Mul`. Single dispatch instead of 5–7.
`MatMulNBitsQkv` op + kernel	`contrib_ops/webgpu/quantization/matmul_nbits_qkv.{cc,h}`, `*.wgsl.template`	Fuses `(Skip)SimplifiedLayerNormalization` + three `MatMulNBits` projections (Q, K, V) sharing the same input. Single dispatch instead of 4.
Op schemas	`core/graph/contrib_ops/contrib_defs.cc`	`MatMulNBitsMlp` and `MatMulNBitsQkv` contrib op schemas (kMSDomain, opset 1).
Graph transformers	`core/optimizer/matmul_nbits_{mlp,qkv}_fusion.{cc,h}`	Pattern-match the source subgraphs and emit the fused ops. EP-gated to WebGPU only — no impact on other EPs. Registered in `graph_transformer_utils.cc`.
Dispatch helpers	`contrib_ops/webgpu/quantization/matmul_nbits_common.{cc,h}` + `matmul_nbits.cc`	Extracts the "would this dispatch use Subgroup-Matrix / DP4A / WideTile?" predicates into pure functions reusable by the fused kernels. No behavior change in the unfused `MatMulNBits` path.
Tests	`test/optimizer/matmul_nbits_{mlp,qkv}_fusion_test.cc`, `graph_transform_utils_test.cc`	Unit tests for the new transformers (positive + negative cases).

Motivation and Context

~25-30% decode TPS throughput improvement on WebGPU + D3D backend on Windows. GPU used: RTX 5060Ti for Qwe3-1.7B.

BEFORE (95 decode TPS): main branch

AFTER (120+ decode TPS): PR branch

…xruntime into hari/webgpu_perf_1

github-actions

You can commit the suggested changes from lintrunner.

Copilot

Pull request overview

This PR adds WebGPU-focused fused operators and optimizer passes for decoder-style MatMulNBits patterns (MLP gate/up and QKV projections), along with tests and a microbenchmark to evaluate decode performance/correctness.

Changes:

Introduces new contrib ops MatMulNBitsMlp and MatMulNBitsQkv (schemas + WebGPU kernels + WGSL templates).
Adds graph transformers MatMulNBitsMlpFusion / MatMulNBitsQkvFusion and corresponding optimizer tests.
Improves WebGPU runtime support (graph-capture buffer manager activation, queue-idle wait helper, better shader compilation diagnostics) and adds a decode microbenchmark.

Reviewed changes

Copilot reviewed 33 out of 33 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
onnxruntime/test/optimizer/matmul_nbits_qkv_fusion_test.cc	New unit tests validating QKV fusion and output contracts on WebGPU.
onnxruntime/test/optimizer/matmul_nbits_mlp_fusion_test.cc	New unit tests validating MLP fusion (simplified/skip + passthrough) on WebGPU.
onnxruntime/test/optimizer/graph_transform_utils_test.cc	Minor formatting-only tweak (blank line).
onnxruntime/test/onnx/microbenchmark/webgpu_matmul_nbits_decode.cc	New benchmark harness for fused/unfused decode paths on WebGPU.
onnxruntime/test/onnx/microbenchmark/main.cc	Adjusts benchmark env logging severity.
onnxruntime/core/session/ort_version_check.h	Makes version parsing consteval-friendly with a macro fallback.
onnxruntime/core/providers/webgpu/webgpu_execution_provider.h	Tracks when graph-capture buffer manager is active.
onnxruntime/core/providers/webgpu/webgpu_execution_provider.cc	Lazily creates/activates graph buffer manager for capture; allocator uses dynamic buffer manager getter.
onnxruntime/core/providers/webgpu/webgpu_context.h	Adds `WaitForQueueIdle()` declaration.
onnxruntime/core/providers/webgpu/webgpu_context.cc	Implements `WaitForQueueIdle()` using `OnSubmittedWorkDone`.
onnxruntime/core/providers/webgpu/program_manager.cc	Enhances pipeline build failures with shader compilation diagnostics.
onnxruntime/core/providers/webgpu/compute_context.h	Adds `FlushAndWait()` convenience for flushing + waiting on queue idle.
onnxruntime/core/providers/webgpu/allocator.h	Adds allocator ctor that accepts a buffer-manager getter function.
onnxruntime/core/providers/webgpu/allocator.cc	Implements getter-based allocator to support switching buffer managers.
onnxruntime/core/optimizer/matmul_nbits_qkv_fusion.h	New transformer declaration for QKV fusion.
onnxruntime/core/optimizer/matmul_nbits_qkv_fusion.cc	New transformer implementation for QKV fusion.
onnxruntime/core/optimizer/matmul_nbits_mlp_fusion.h	New transformer declaration for MLP fusion.
onnxruntime/core/optimizer/matmul_nbits_mlp_fusion.cc	New transformer implementation for MLP fusion.
onnxruntime/core/optimizer/graph_transformer_utils.cc	Registers the new fusion transformers.
onnxruntime/core/graph/contrib_ops/contrib_defs.cc	Adds contrib operator schemas/docs for `MatMulNBitsMlp` and `MatMulNBitsQkv`.
onnxruntime/contrib_ops/webgpu/webgpu_contrib_kernels.cc	Registers WebGPU kernels for the new fused ops.
onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits_qkv.wgsl.template	New WGSL template implementing fused QKV decode kernel.
onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits_qkv.h	New WebGPU kernel wrapper for `MatMulNBitsQkv`.
onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits_qkv.cc	New WebGPU kernel implementation for `MatMulNBitsQkv`.
onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits_mlp_wide_tile_m1.wgsl.template	New WGSL template for an MLP wide-tile variant.
onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits_mlp.wgsl.template	New WGSL template implementing fused MLP (optionally with norm/skip/passthrough).
onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits_mlp.h	New WebGPU kernel wrapper for `MatMulNBitsMlp`.
onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits_mlp.cc	New WebGPU kernel implementation for `MatMulNBitsMlp`.
onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits_common.h	Adds declarations for “would apply” dispatch-selection helpers and shared constants.
onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits_common.cc	Implements the new dispatch-selection helpers.
onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits.cc	Refactors path selection to use the new “would apply” helpers.
onnxruntime/contrib_ops/webgpu/quantization/dp4a_matmul_mlp.wgsl.template	Adds WGSL template for DP4A MLP path.
cmake/onnxruntime_unittests.cmake	Wires the new WebGPU decode benchmark into the benchmark target sources.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…shader diagnostics These changes are kept on hari/webgpu_perf_1_full locally. The lazy buffer-mgr fix is being submitted as a separate PR (branch hari/webgpu_graph_capture_buffer_fix) because it is an independent correctness fix for a pre-existing latent bug, exposed but not introduced by these fusions.

This template file was added speculatively but is not referenced by any kernel, include, or build rule. Removing to keep the PR clean.

…_transformer_utils

Copilot

Pull request overview

Copilot reviewed 20 out of 20 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

The shared-EP path through TransformerTester triggers a SEH 0xC0000005 in CI when the EP outlives a per-session profiler whose pointer is still cached on the EP. A separate fix to the WebGPU EP's session_profiler_ lifetime is in flight; meanwhile, switch the 8 MatMulNBits MLP and QKV WebGPU fusion-vs- unfused tests to a small RunWebGpuFusionTransformerTest helper that creates a fresh execution provider per session via a factory lambda. Production code is unchanged.

qjia7

File: onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits_mlp.cc, lines 33-127

ApplySimplifiedLayerNorm (lines 33-75) and ApplySkipSimplifiedLayerNorm (lines 77-127) duplicate the dispatch logic from LayerNormProgram in core/providers/webgpu/nn/layer_norm.cc and SkipLayerNormProgram in contrib_ops/webgpu/bert/skip_layer_norm.cc respectively.

The split condition (norm_size % 512 == 0 && norm_count == 1), workgroup sizing (workgroup_size_x = 128), uniform variable layout, and component handling are all replicated. If any of these change in the original kernels (e.g., workgroup size tuning, a new split threshold, or additional uniform variables), the copies here will silently diverge.

Consider extracting these as reusable utility functions in skip_layer_norm.h / layer_norm.h and calling them from here, rather than duplicating the setup logic.

qjia7

For MLP fusion, how do you think we extend the current matmulnits to support matmulnbits + activation + mul? Then for prefill, just call two matmulnbits.
And in future, maybe we can optimize builder.py to directly generate the new fused MatMulNbitsMLP op in onnx model.

For QKV fusion, microsoft/onnxruntime-genai#2137 resolves the QKV packing issue in builder.py. Do we still need dynamic fusion in the code after this fix?

hariharans29 · 2026-05-10T22:30:12Z

For MLP fusion, how do you think we extend the current matmulnits to support matmulnbits + activation + mul? Then for prefill, just call two matmulnbits. And in future, maybe we can optimize builder.py to directly generate the new fused MatMulNbitsMLP op in onnx model.

For QKV fusion, microsoft/onnxruntime-genai#2137 resolves the QKV packing issue in builder.py. Do we still need dynamic fusion in the code after this fix?

The GenAI model builder QKV fusion does 3 MatmulNBits -> 1 large QKV MatmulNBits + 1 Split

This ORT dynamic fusion does 1 SLN + 3 MatmulNBits-> 1 QKV MatmulNBits + produce skip values

I can update my fusion logic to understand both the "unfused" Qwen QKV (1 SLN + 3 MatmulNBits) and the new fused nodes (1 SLN + 1 MatmulNBits + 1 Split). Would that be okay ?

qjia7 · 2026-05-11T05:44:30Z

The QWEN pattern does 3 MatmulNBits -> 1 large QKV MatmulNBits + 1 Split

This dynamic fusion does 1 SLN + 3 MatmulNBits-> 1 QKV MatmulNBits + produce skip values

I can update my fusion logic to understand both the "unfused" Qwen QKV (1 SLN + 3 MatmulNBits) and the new fused nodes (1 SLN + 1 MatmulNBits + 1 Split). Would that be okay ?

I am fine with the current status. You can add 1 SLN + 1 MatmulNBits + 1 Split support in follow-up PRs.

For the MLP one, can you also extent it to support Gemma model (gate+up GELU gated)? Maybe also need to fuse cast (which is used to improve the quality microsoft/onnxruntime-genai#1448) into it since I see the subgraph in gemma3 is like below:

qjia7 · 2026-05-11T06:20:06Z

+                     const Node& sigmoid,
+                     const Node& silu_mul,
+                     const Node& final_mul) {
+  if (!IsMatMulNBitsWithoutZeroPointOrGroupIdx(gate_matmul) || !IsMatMulNBitsWithoutZeroPointOrGroupIdx(up_matmul) ||


fyi, with #28410, you may also need to support QuickGelu.

hariharans29 · 2026-05-11T20:47:56Z

File: onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits_mlp.cc, lines 33-127

ApplySimplifiedLayerNorm (lines 33-75) and ApplySkipSimplifiedLayerNorm (lines 77-127) duplicate the dispatch logic from LayerNormProgram in core/providers/webgpu/nn/layer_norm.cc and SkipLayerNormProgram in contrib_ops/webgpu/bert/skip_layer_norm.cc respectively.

The split condition (norm_size % 512 == 0 && norm_count == 1), workgroup sizing (workgroup_size_x = 128), uniform variable layout, and component handling are all replicated. If any of these change in the original kernels (e.g., workgroup size tuning, a new split threshold, or additional uniform variables), the copies here will silently diverge.

Consider extracting these as reusable utility functions in skip_layer_norm.h / layer_norm.h and calling them from here, rather than duplicating the setup logic.

Addressed

hariharans29 · 2026-05-11T21:00:50Z

The QWEN pattern does 3 MatmulNBits -> 1 large QKV MatmulNBits + 1 Split
This dynamic fusion does 1 SLN + 3 MatmulNBits-> 1 QKV MatmulNBits + produce skip values
I can update my fusion logic to understand both the "unfused" Qwen QKV (1 SLN + 3 MatmulNBits) and the new fused nodes (1 SLN + 1 MatmulNBits + 1 Split). Would that be okay ?

I am fine with the current status. You can add 1 SLN + 1 MatmulNBits + 1 Split support in follow-up PRs.

For the MLP one, can you also extent it to support Gemma model (gate+up GELU gated)? Maybe also need to fuse cast (which is used to improve the quality microsoft/onnxruntime-genai#1448) into it since I see the subgraph in gemma3 is like below:

Can I add support for the Gemma3 pattern in a future PR ? In this PR, I added "extensibility" support for the activation - we can support multiple gated activations in the future - but this PR will only support Silu (for the Qwen use-case).

In future, we can update the dynamic fusion to support the fp16 <-> fp32 cast absorption and Gelu activation and support it in the kernel too. What do you think ?

If you are okay - I will open 2 issues in the repo as follow-ups for me:

Update the QKV fusion pattern to fuse the 1 SLN + "fused" QKV MatmulNBits + 1 Split (once the GenAI model builder change lands)
Update the MLP fusion pattern for Gemma3

hariharans29 · 2026-05-11T22:00:34Z

For MLP fusion, how do you think we extend the current matmulnits to support matmulnbits + activation + mul? Then for prefill, just call two matmulnbits. And in future, maybe we can optimize builder.py to directly generate the new fused MatMulNbitsMLP op in onnx model.

For QKV fusion, microsoft/onnxruntime-genai#2137 resolves the QKV packing issue in builder.py. Do we still need dynamic fusion in the code after this fix?

Re: "For MLP fusion, how do you think we extend the current matmulnits to support matmulnbits + activation + mul? Then for prefill, just call two matmulnbits. And in future, maybe we can optimize builder.py to directly generate the new fused MatMulNbitsMLP op in onnx model."

Is this comment still valid ? Is this an approach you'd like me to explore ?

Refactor in response to PR review feedback. The MatMulNBits MLP and QKV fusion kernels previously each carried their own private copies of the SimplifiedLayerNormalization and SkipSimplifiedLayerNormalization program launchers (`GetOverrideShape` + `ApplySimplifiedLayerNorm` + `ApplySkipSimplifiedLayerNorm`). Extract these into reusable helpers exposed by the existing LayerNorm / SkipLayerNorm kernel sources so fused kernels can drop the duplication. * core/providers/webgpu/nn/layer_norm.{h,cc}: - Expose `RunLayerNormProgram(...)` so other kernels can launch the simplified layer-norm program with consistent uniforms / shape overrides. * contrib_ops/webgpu/bert/skip_layer_norm.{h,cc}: - Expose `RunSkipLayerNormProgram(...)` mirroring the same shape for the SkipSimplifiedLayerNormalization variant. * contrib_ops/webgpu/quantization/matmul_nbits_qkv.cc: - Adopt the shared helpers and delete the local copies. No behavior change; emitted WGSL and dispatch are byte-identical.

Two coupled cleanups to the MatMulNBitsMlp kernel, kept together because they touch the same file: 1. Adopt the shared `RunLayerNormProgram` / `RunSkipLayerNormProgram` helpers introduced in the prior commit. Deletes the local copies of `GetOverrideShape`, `ApplySimplifiedLayerNorm`, and `ApplySkipSimplifiedLayerNorm`. No behavior change. 2. Introduce a small `MlpActivationKind` enum so the kernel can later gain GELU / GELU+Cast support (e.g. for Gemma-style MLPs) without reshaping the call paths or schema. Today the enum has a single value, `Silu = 0`, and the emitted WGSL is byte-identical to before. * matmul_nbits_mlp.h: - Add `MlpActivationKind` enum and `ParseMlpActivation()`. Kernel stores the parsed kind in `activation_kind_`. * matmul_nbits_mlp.cc: - Thread `MlpActivationKind` through `MatMulNBitsMlpProgram`, `MatMulNBitsMlpDecodeProgram`, and `ApplyUnfusedMlp`. Include the kind in each program's CacheHint. - Add `EmitGateActivationExpr()` so the inline kernel emits the activation expression via a single helper; today returns the SiLU expression. - While here, collapse the four identical `WGSL_TEMPLATE_APPLY` branches in `MatMulNBitsMlpDecodeProgram::GenerateShaderCode` into one call. Roughly 120 lines removed; emitted WGSL unchanged. * matmul_nbits_mlp.wgsl.template: - Add `#param activation_kind`. Wrap SiLU emission in `#if activation_kind == 0 ... #endif` and produce the activated value through a single `activated_value` binding so additional activations can be added with a new `#elif` branch. The schema already declares `activation` as a generic `STRING`, so no schema change is required.

After QuickGeluFusion is enabled for the WebGPU EP (upstream PR #28410), the SwiGLU gate subgraph `gate * Sigmoid(gate)` is collapsed into a single `com.microsoft::QuickGelu(gate, alpha=1.0)` node before MatMulNBitsMlpFusion runs. Without this change, the MLP fusion would silently stop firing for Qwen3 / Llama / Phi style WebGPU models. * core/optimizer/matmul_nbits_mlp_fusion.cc: - Recognize the QuickGelu-decomposed shape gate_matmul -> com.microsoft::QuickGelu(alpha=1.0) -> final_mul in addition to the existing Sigmoid+Mul shape. Validates QuickGelu's `alpha == 1.0` (SiLU-equivalent). - Factor common pair validation into `ValidateMatMulNBitsPair` and keep shape-specific checks in `IsFuseCandidateSilu` and `IsFuseCandidateQuickGelu`. - Restructure the main matching loop to dispatch on which shape was found and track the intermediate nodes to remove in a small vector, so the node-removal block stays uniform across shapes. * core/providers/webgpu/math/unary_elementwise_ops.h: - Fix the `QuickGeluImpl` WGSL shader for fp16 by wrapping `1.0`, `0.0`, and `uniforms.attr` in `x_element_t(...)` casts. Without this, pipeline creation fails on fp16 models with `Invalid ShaderModule "QuickGelu"`. Matches the fix in PR #28410 so the in-tree build can run QuickGelu on fp16 models immediately rather than waiting on that PR to land. * test/optimizer/matmul_nbits_mlp_fusion_test.cc: - Add unit coverage mirroring the existing SiLU tests. Introduces an `ActivationShape` enum and parameterizes the existing test-pattern builder. The graph-shape checkers now also assert zero `com.microsoft.QuickGelu` nodes after fusion. Adds four tests: * Fusion only (Simplified-LN anchor) * Fusion only (Skip-Simplified-LN anchor) * Fused vs unfused correctness on WebGPU (Simplified-LN) * Fused vs unfused correctness on WebGPU (Skip-Simplified-LN) Correctness tests use a slightly looser 5e-3 tolerance because the by-cases sigmoid in the QuickGelu shader produces marginally different fp16 rounding than the fused kernel's direct SiLU evaluation; the two are mathematically equivalent.

Widen `QuickGeluFusion`'s compatible-EP set from `cpu_acl_cuda_dml_eps` to `cpu_acl_cuda_dml_js_webgpu_eps` so the `x * Sigmoid(x)` SwiGLU gate pattern is folded into a single `com.microsoft::QuickGelu` node on WebGPU and JSEP models. Without this, the QuickGelu match branch added to `MatMulNBitsMlpFusion` in the prior commit is unreachable on real WebGPU models, and the `QuickGelu` fp16 shader fix in `unary_elementwise_ops.h` cannot be exercised end-to-end. Mirrors upstream PR #28410 (registers `QuickGeluFusion` for WebGPU/JSEP and fixes the `QuickGelu` fp16 shader). This commit is expected to be redundant once #28410 lands; rebase will drop it cleanly.

Copilot

Pull request overview

Copilot reviewed 27 out of 27 changed files in this pull request and generated 5 comments.

hariharans29 and others added 10 commits April 8, 2026 15:51

Initial commit

7c0c1b9

Merge remote-tracking branch 'origin' into hari/webgpu_perf_1

c42879d

More changes

a0550b6

Merge branch 'hari/webgpu_perf_1' of https://github.com/microsoft/onn…

c55adfe

…xruntime into hari/webgpu_perf_1

Stage

ee09d8e

More changes

aa357ee

Stage

318b26b

Worka nd good perf

ad53b3d

Skip + MatmulNBitsSilu fusion - works and good perf

b67ae81

Cleanup

01671d9

hariharans29 changed the title ~~[DO NOT REVIEW]: Title-TODO~~ [DO NOT REVIEW]: TODO Apr 30, 2026

hariharans29 added 2 commits April 29, 2026 18:35

Move back to workgroup/tile_size default

30485dd

Merge main

27317b8

hariharans29 requested a review from Copilot April 30, 2026 02:25

Copilot started reviewing on behalf of hariharans29 April 30, 2026 02:26 View session

github-actions Bot reviewed Apr 30, 2026

View reviewed changes

github-advanced-security AI found potential problems Apr 30, 2026

View reviewed changes

Copilot AI reviewed Apr 30, 2026

View reviewed changes

Comment thread onnxruntime/core/optimizer/matmul_nbits_qkv_fusion.cc Outdated

Comment thread onnxruntime/core/optimizer/matmul_nbits_qkv_fusion.cc

Comment thread onnxruntime/core/optimizer/matmul_nbits_mlp_fusion.cc Outdated

Comment thread onnxruntime/core/optimizer/matmul_nbits_mlp_fusion.cc

hariharans29 added 6 commits April 30, 2026 19:42

Merge remote-tracking branch 'origin' into hari/webgpu_perf_1

a56fb56

Copilot comments + Fix builds + Fix lint + Fusion diagrams

13bf979

Fix test

d1090c8

Fix builds

ffacd4c

Fixes

92874ce

hariharans29 mentioned this pull request May 2, 2026

[WebGPU] Fix stale buffer bindings on first graph-capture replay #28325

Draft

hariharans29 changed the title ~~[DO NOT REVIEW]: TODO~~ [WebGPU]: QKV and MLP fusions for Qwen3 May 2, 2026

hariharans29 added 3 commits May 1, 2026 21:00

Remove unused dp4a_matmul_mlp.wgsl.template

2039c7f

This template file was added speculatively but is not referenced by any kernel, include, or build rule. Removing to keep the PR clean.

Cleanup: drop unused empty namespace + env_var_utils include in graph…

a02cf12

…_transformer_utils

Merge remote-tracking branch 'origin' into hari/webgpu_perf_1

beb1709

hariharans29 requested a review from Copilot May 2, 2026 04:05

Copilot started reviewing on behalf of hariharans29 May 2, 2026 04:06 View session

Copilot AI reviewed May 2, 2026

View reviewed changes

hariharans29 changed the title ~~[WebGPU]: QKV and MLP fusions for Qwen3~~ [WebGPU] QKV and MLP fusions for Qwen3 May 2, 2026

hariharans29 added 4 commits May 1, 2026 22:35

Copilot comments

9065063

Fixes

4ac9c81

Fix

306fba3

guschmue added the ep:WebGPU ort-web webgpu provider label May 5, 2026

qjia7 reviewed May 9, 2026

View reviewed changes

Comment thread onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits_mlp_wide_tile_m1.wgsl.template Outdated

Remove unused file

a90a049

qjia7 reviewed May 11, 2026

View reviewed changes

hariharans29 added 4 commits May 11, 2026 15:29

hariharans29 commented May 11, 2026

View reviewed changes

Comment thread onnxruntime/core/optimizer/graph_transformer_utils.cc

hariharans29 requested a review from Copilot May 11, 2026 23:17

Copilot started reviewing on behalf of hariharans29 May 11, 2026 23:18 View session

Copilot AI reviewed May 11, 2026

View reviewed changes

hariharans29 added 2 commits May 11, 2026 19:51

Copilot comments

eaa6635

Merge main and resolve conflicts

106c07e

hariharans29 changed the title ~~[WebGPU] QKV and MLP fusions for Qwen3~~ [WebGPU] QKV and MLP fusions for Qwen3-style models May 12, 2026

hariharans29 changed the title ~~[WebGPU] QKV and MLP fusions for Qwen3-style models~~ [WebGPU] QKV and MLP layer fusions for Qwen3-style models May 12, 2026

Conversation

hariharans29 commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Summary

Motivation and Context

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

qjia7 left a comment

Choose a reason for hiding this comment

Uh oh!

qjia7 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hariharans29 commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qjia7 commented May 11, 2026

Uh oh!

qjia7 May 11, 2026

Choose a reason for hiding this comment

Uh oh!

hariharans29 May 11, 2026

Choose a reason for hiding this comment

Uh oh!

hariharans29 commented May 11, 2026

Uh oh!

hariharans29 commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hariharans29 commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hariharans29 commented Apr 30, 2026 •

edited

Loading

hariharans29 commented May 10, 2026 •

edited

Loading

hariharans29 commented May 11, 2026 •

edited

Loading

hariharans29 commented May 11, 2026 •

edited

Loading