[RISC-V] Add RVV INT8 GEMM/GEMV, M=1 routing, and activation kernels (follow-up #28261)#28308
Open
qiurui144 wants to merge 8 commits intomicrosoft:mainfrom
Open
[RISC-V] Add RVV INT8 GEMM/GEMV, M=1 routing, and activation kernels (follow-up #28261)#28308qiurui144 wants to merge 8 commits intomicrosoft:mainfrom
qiurui144 wants to merge 8 commits intomicrosoft:mainfrom
Conversation
Add onnxruntime/core/mlas/lib/riscv64/qgemm_kernel_rvv.cpp, a standard-RVV (baseline V extension, VLEN>=128, dynamic vsetvli) INT8 GEMM kernel using the vwmulu.vv + vwaddu.wv widening pattern. Works for any VLEN without rebuild. Wired into the existing RISCV64 RVV build block introduced by microsoft#28261: - cmake/onnxruntime_mlas.cmake: append qgemm_kernel_rvv.cpp to the if(HAS_RISCV64_RVV) source list (additive, no new block). - qgemm.h: add an MLAS_TARGET_RISCV64 dispatch branch that selects MlasGemmU8S8DispatchRvv for all four (A,B) signedness combinations, matching the inline-extern style used by ARM64EC / WASM_SIMD / S390X branches above it. Measured K3 (SpacemiT X100, VLEN=256, 8T): bge-small INT8 kernel throughput ~2.5x vs scalar default. FP32 bge-small no-dispatch P50 stays at 89ms (unchanged from upstream main; no regression). Signed-off-by: qiurui144 <happyqiurui@163.com>
The existing M=1 fast path in MlasGemmBatch already routes to MlasGemvFloatKernel() for ARM64/WASM when TransB == CblasNoTrans; extend the '#elif' guard in sgemm.cpp to include RISCV64 and the forward declaration in mlasi.h alongside the ARM64/WASM branch. Symbol is provided by the existing scalar/SgemvKernelScalar.cpp, which compiles cleanly under rv-gcc 15.2 with autovectorisation. Hand-written RVV intrinsics for this kernel were evaluated and discarded: a 4x K-unrolled LMUL=m4 implementation produced 2.42 GFLOPS at K=N=768 vs 6.85 GFLOPS for the rv-gcc 15.2 autovec output of the scalar source on the same hardware (SpacemiT X100, VLEN=256). The dual-issue in-order pipeline cannot hide the dependency chain between vle32.v + vfmacc.vf reusing the same m4 register file; autovec's narrower LMUL=m1 with independent vector regs per iteration sustains higher utilisation. This matches existing CLAUDE.md/contrib guidance to prefer portable implementations unless hand-written intrinsics show >=1.3x kernel speedup. Signed-off-by: qiurui144 <happyqiurui@163.com>
Add an RVV-vectorised activation/compute family at onnxruntime/core/mlas/lib/riscv64/activation_kernel_rvv.cpp covering Erf, Tanh, Logistic, ComputeExpF32, Silu and GeluErf. Wired into the dispatch framework introduced by microsoft#28261: - mlasi.h: extend the existing `MLAS_TARGET_RISCV64 && MLAS_USE_RVV` kernel-decl block with the six new symbols (Erf, Logistic, GeluErf, Silu, Tanh, ComputeExpF32), and add four MLAS_PLATFORM dispatch fields (GeluErfKernelRoutine, SiluKernelRoutine, TanhKernelRoutine, ComputeExpF32Kernel) under a RISCV64-only block. - platform.cpp: in the RISCV64 init block, default-assign the four new fields to the upstream scalar kernels and override them with the RVV variants inside the existing `if (has_rvv)` gate. - erf.cpp / logistic.cpp / tanh.cpp / compute.cpp / gelu.cpp / silu.cpp: extend the dispatch-site `#if defined(MLAS_TARGET_AMD64) || ...` guard to include `MLAS_TARGET_RISCV64`. - cmake/onnxruntime_mlas.cmake: append activation_kernel_rvv.cpp to the `if(HAS_RISCV64_RVV)` source list (additive, no new block). Kernel strategy: LMUL=m4 throughout (32 floats per vector at VLEN=256, scales with VLEN via dynamic vsetvli). exp uses Cody-Waite range reduction + 6th-order minimax polynomial; erf/gelu use the Abramowitz & Stegun 5-term approximation (max ~2.5e-5 abs error). Silu fuses `x * sigmoid(x)` in a single pass to halve memory traffic. Signed-off-by: qiurui144 <happyqiurui@163.com>
Specialize MlasGemmQuantTryGemvKernel for MLAS_GEMM_QUANT_KERNEL_RVV
with a hand-written RVV kernel covering the U8 x S8 case, mirroring
the existing AVX2 specialization in qgemm_kernel_avx2.cpp. The default
template (qgemm.h) returns false; this overrides it so M=1 INT8 GEMV
calls bypass the PackA/PackB pipeline.
Strategy: K-outer, N-tiled with LMUL=m4 e32 i32 accumulator. For each
K iteration broadcast A[k] as i16 (zero-extended from u8), sign-extend
B[k*ldb + n..n+vl) from i8 to i16, then widen-mul-accumulate into the
i32 acc with vwmacc.vx. Acts only on the U8 x S8 combination; all other
signedness combinations return false and fall through to the packed
kernel as today.
Microbench on K3 (SpacemiT X100, VLEN=256, 1 thread):
K=N=384: scalar 1.45 autovec 2.34 RVV m4 16.68 GOPS (7.1x autovec)
K=N=768: scalar 1.46 autovec 2.34 RVV m4 6.17 GOPS (2.6x autovec)
K=N=4096: scalar 1.46 autovec 2.33 RVV m4 1.92 GOPS (memory-bound,
no benefit)
Numerical correctness: zero diff against scalar reference across
random U8 x S8 inputs at K=N in {384, 768, 4096}.
The kernel is invoked from MlasGemmQuantOperation only when
RangeCountM == 1 and zero-points are zero (existing qgemm.h gate),
so encoder workloads such as BERT seq=128 do not exercise it. The
target use cases match AVX2's MlasGemvU8S8KernelAvx2: dynamic-quant
LLM decode, tiny-batch inference, and any M=1 dispatch site that
would otherwise pay the PackB cost on a one-row matmul.
Signed-off-by: qiurui144 <happyqiurui@163.com>
Author
|
@hariharans29 , base on #28261 PR ,i added some patches for RVV improvements. thanks for you review. |
Contributor
There was a problem hiding this comment.
Pull request overview
Adds additional RISC-V RVV-optimized MLAS kernels (INT8 quantized GEMM/GEMV and several FP32 activations) and wires them into the existing riscv64 RVV runtime dispatch path to improve performance on vector-capable hardware.
Changes:
- Add RVV INT8 QGEMM kernel (including an M=1 U8×S8 GEMV fast-path specialization).
- Add RVV unary activation kernels (Erf/Tanh/Logistic/Exp/SiLU/GELU-erf) and route existing activation entrypoints through MLAS platform dispatch on riscv64.
- Extend SGEMM M=1 routing to use the existing GEMV float kernel on riscv64 and update build wiring to compile new RVV sources.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| cmake/onnxruntime_mlas.cmake | Adds new riscv64 RVV sources to the MLAS build under HAS_RISCV64_RVV. |
| onnxruntime/core/mlas/lib/platform.cpp | Initializes riscv64 quant dispatch + activation routine pointers; switches to RVV implementations when available at runtime. |
| onnxruntime/core/mlas/lib/mlasi.h | Declares new RVV kernel symbols and adds riscv64 platform fields for quant dispatch and activations. |
| onnxruntime/core/mlas/lib/qgemm.h | Adds riscv64 4-signedness dispatch selection in MlasGemmQuantGetDispatch. |
| onnxruntime/core/mlas/lib/riscv64/qgemm_kernel_rvv.cpp | New RVV INT8 QGEMM kernel + U8×S8 GEMV M=1 fast path and RVV packing specializations. |
| onnxruntime/core/mlas/lib/riscv64/activation_kernel_rvv.cpp | New RVV FP32 activation kernels and a shared exp approximation. |
| onnxruntime/core/mlas/lib/sgemm.cpp | Enables the M=1 GEMV routing path for riscv64 when TransB == NoTrans. |
| onnxruntime/core/mlas/lib/{erf.cpp,logistic.cpp,tanh.cpp,gelu.cpp,silu.cpp,compute.cpp} | Routes riscv64 activation entrypoints through platform dispatch to enable RVV overrides. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
qiurui144
added a commit
to qiurui144/onnxruntime
that referenced
this pull request
May 4, 2026
Squashed 4 fixes per AI audit (>13M-point precision verify on K3 X100). Files touched: - mlas/lib/riscv64/qgemm_kernel_rvv.cpp: drop dead MlasGemmU8S8DispatchRvvPtr; instantiate four signedness-specific dispatch structs via macro - mlas/lib/mlasi.h: declare U8U8/S8S8/S8U8 dispatch externs - mlas/lib/platform.cpp: wire each MLAS_PLATFORM signedness slot to its self-describing symbol (mirrors AVX2Vnni convention) - mlas/lib/riscv64/activation_kernel_rvv.cpp: tighten EXP_CLAMP_MAX 88.722839f -> 87.0f to keep n*LOG2E rounding < 128 (was producing +Inf at boundary); add internal defensive clamp in exp_f32m4 so callers cannot trigger Inf via large unbounded inputs (silu/-200). Verified L1-L11 audit (>13M points) on K3 X100, 0 failures. Signed-off-by: qiurui144 <happyqiurui@163.com>
qiurui144
added a commit
to qiurui144/onnxruntime
that referenced
this pull request
May 4, 2026
Squashed 4 fixes per AI audit (>13M-point precision verify on K3 X100). Files touched: - mlas/lib/riscv64/qgemm_kernel_rvv.cpp: drop dead MlasGemmU8S8DispatchRvvPtr; instantiate four signedness-specific dispatch structs via macro - mlas/lib/mlasi.h: declare U8U8/S8S8/S8U8 dispatch externs - mlas/lib/platform.cpp: wire each MLAS_PLATFORM signedness slot to its self-describing symbol (mirrors AVX2Vnni convention) - mlas/lib/riscv64/activation_kernel_rvv.cpp: tighten EXP_CLAMP_MAX 88.722839f -> 87.0f to keep n*LOG2E rounding < 128 (was producing +Inf at boundary); add internal defensive clamp in exp_f32m4 so callers cannot trigger Inf via large unbounded inputs (silu/-200). Verified L1-L11 audit (>13M points) on K3 X100, 0 failures. Signed-off-by: qiurui144 <happyqiurui@163.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Mirror AVX2Vnni: wire each MLAS_PLATFORM signedness slot
(U8S8/U8U8/S8S8/S8U8) to its self-describing symbol instead
of aliasing all four to MlasGemmU8S8DispatchRvv. The RVV
INT8 kernel itself already handles all signedness via the
existing MlasGemmQuantFixupZeroPoint{A,B} XOR-0x80 fixup;
only the symbol naming changes.
Signed-off-by: qiurui144 <happyqiurui@163.com>
At x = EXP_CLAMP_MAX (88.722839f), x*LOG2E rounds exactly to 128 inside `n = vfcvt_x_f(...)`, so (n+127)<<23 reinterpreted as float produces +Inf instead of a finite value. Tighten the clamp to +/-87.0f so n stays <= 126 and the result remains finite for the entire input range. Signed-off-by: qiurui144 <happyqiurui@163.com>
The previous design required each caller to clamp x into [-87, 87] before invoking exp_f32m4. Logistic/SiLU only clamped the lower bound, so e.g. silu(-200) computed exp(+200) internally and returned ~ -input instead of ~ 0. Move the upper/lower clamp inside exp_f32m4 itself so all callers are protected regardless of their own clamping. Signed-off-by: qiurui144 <happyqiurui@163.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Follow-up to #28261 (RVV CPU EP). Four additive commits on top of the
existing
if(HAS_RISCV64_RVV)block incmake/onnxruntime_mlas.cmake:riscv64/qgemm_kernel_rvv.cpp) —vwmulu.vv/vwaddu.wvwidening; wired throughMLAS_PLATFORM4-signednessdispatch (LARCH64 idiom).
MlasGemvFloatKernel#elifto RISCV64; kernel comes from existingSgemvKernelScalar.cpp(rv-gcc autovec).
riscv64/activation_kernel_rvv.cpp) — RVVErf,Tanh,Logistic,ComputeExpF32,Silu,GeluErf.MlasGemmQuantTryGemvKernel<...RVV>specialization (mirrors AVX2
qgemm_kernel_avx2.cpp:131); U8×S8 only.Performance
K3 (SpacemiT X100, VLEN=256, 4 threads, p50 ms over 30 reps, real
BAAI/bge-* ONNX inputs). Baseline =
microsoft/onnxruntimepost-#28261(
62f742f1aa).Cross-built with
riscv64-linux-gnu-g++15.2 +-march=rv64gcv -mabi=lp64d.FP32 transformer encoders
INT8 quantized
INT8 GEMV M=1 (kernel-level micro-bench, 1 thread)
The GEMV path triggers only when
RangeCountM == 1and zero-points arezero (
qgemm.h:331gate); BERT seq=128 encoders do not exercise it, soits contribution is not visible in the e2e tables above.