[RISC-V] Add RVV INT8 GEMM/GEMV, M=1 routing, and activation kernels (follow-up #28261) by qiurui144 · Pull Request #28308 · microsoft/onnxruntime

qiurui144 · 2026-05-01T12:01:03Z

Description

Follow-up to #28261 (RVV CPU EP). Four additive commits on top of the
existing if(HAS_RISCV64_RVV) block in cmake/onnxruntime_mlas.cmake:

INT8 GEMM (riscv64/qgemm_kernel_rvv.cpp) — vwmulu.vv /
vwaddu.wv widening; wired through MLAS_PLATFORM 4-signedness
dispatch (LARCH64 idiom).
M=1 SGEMM routing — extends the ARM64/WASM MlasGemvFloatKernel
#elif to RISCV64; kernel comes from existing SgemvKernelScalar.cpp
(rv-gcc autovec).
Activation kernels (riscv64/activation_kernel_rvv.cpp) — RVV
Erf, Tanh, Logistic, ComputeExpF32, Silu, GeluErf.
INT8 GEMV M=1 fast path — MlasGemmQuantTryGemvKernel<...RVV>
specialization (mirrors AVX2 qgemm_kernel_avx2.cpp:131); U8×S8 only.

Performance

K3 (SpacemiT X100, VLEN=256, 4 threads, p50 ms over 30 reps, real
BAAI/bge-* ONNX inputs). Baseline = microsoft/onnxruntime post-#28261
(62f742f1aa).
Cross-built with riscv64-linux-gnu-g++ 15.2 + -march=rv64gcv -mabi=lp64d.

FP32 transformer encoders

Model	Upstream	This PR	Speedup
BAAI/bge-small-zh-v1.5	66.3 ms	63.8 ms	1.04×
BAAI/bge-base-zh-v1.5	404.4 ms	393.5 ms	1.03×
BAAI/bge-reranker-base	403.7 ms	391.7 ms	1.03×

INT8 quantized

Model	Upstream	This PR	Speedup
BAAI/bge-small-zh-v1.5 INT8	301.3 ms	131.3 ms	2.29×
BAAI/bge-base-zh-v1.5 INT8	1958.8 ms	669.2 ms	2.93×
BAAI/bge-reranker-base INT8	1956.8 ms	668.6 ms	2.93×

INT8 GEMV M=1 (kernel-level micro-bench, 1 thread)

Shape	scalar	autovec	this PR	vs autovec
K=N=384	1.45 GOPS	2.34 GOPS	16.68 GOPS	7.13×
K=N=768	1.46 GOPS	2.34 GOPS	6.17 GOPS	2.64×
K=N=4096	1.46 GOPS	2.33 GOPS	1.92 GOPS	0.82× (memory-bound)

The GEMV path triggers only when RangeCountM == 1 and zero-points are
zero (qgemm.h:331 gate); BERT seq=128 encoders do not exercise it, so
its contribution is not visible in the e2e tables above.

Add onnxruntime/core/mlas/lib/riscv64/qgemm_kernel_rvv.cpp, a standard-RVV (baseline V extension, VLEN>=128, dynamic vsetvli) INT8 GEMM kernel using the vwmulu.vv + vwaddu.wv widening pattern. Works for any VLEN without rebuild. Wired into the existing RISCV64 RVV build block introduced by microsoft#28261: - cmake/onnxruntime_mlas.cmake: append qgemm_kernel_rvv.cpp to the if(HAS_RISCV64_RVV) source list (additive, no new block). - qgemm.h: add an MLAS_TARGET_RISCV64 dispatch branch that selects MlasGemmU8S8DispatchRvv for all four (A,B) signedness combinations, matching the inline-extern style used by ARM64EC / WASM_SIMD / S390X branches above it. Measured K3 (SpacemiT X100, VLEN=256, 8T): bge-small INT8 kernel throughput ~2.5x vs scalar default. FP32 bge-small no-dispatch P50 stays at 89ms (unchanged from upstream main; no regression). Signed-off-by: qiurui144 <happyqiurui@163.com>

The existing M=1 fast path in MlasGemmBatch already routes to MlasGemvFloatKernel() for ARM64/WASM when TransB == CblasNoTrans; extend the '#elif' guard in sgemm.cpp to include RISCV64 and the forward declaration in mlasi.h alongside the ARM64/WASM branch. Symbol is provided by the existing scalar/SgemvKernelScalar.cpp, which compiles cleanly under rv-gcc 15.2 with autovectorisation. Hand-written RVV intrinsics for this kernel were evaluated and discarded: a 4x K-unrolled LMUL=m4 implementation produced 2.42 GFLOPS at K=N=768 vs 6.85 GFLOPS for the rv-gcc 15.2 autovec output of the scalar source on the same hardware (SpacemiT X100, VLEN=256). The dual-issue in-order pipeline cannot hide the dependency chain between vle32.v + vfmacc.vf reusing the same m4 register file; autovec's narrower LMUL=m1 with independent vector regs per iteration sustains higher utilisation. This matches existing CLAUDE.md/contrib guidance to prefer portable implementations unless hand-written intrinsics show >=1.3x kernel speedup. Signed-off-by: qiurui144 <happyqiurui@163.com>

Add an RVV-vectorised activation/compute family at onnxruntime/core/mlas/lib/riscv64/activation_kernel_rvv.cpp covering Erf, Tanh, Logistic, ComputeExpF32, Silu and GeluErf. Wired into the dispatch framework introduced by microsoft#28261: - mlasi.h: extend the existing `MLAS_TARGET_RISCV64 && MLAS_USE_RVV` kernel-decl block with the six new symbols (Erf, Logistic, GeluErf, Silu, Tanh, ComputeExpF32), and add four MLAS_PLATFORM dispatch fields (GeluErfKernelRoutine, SiluKernelRoutine, TanhKernelRoutine, ComputeExpF32Kernel) under a RISCV64-only block. - platform.cpp: in the RISCV64 init block, default-assign the four new fields to the upstream scalar kernels and override them with the RVV variants inside the existing `if (has_rvv)` gate. - erf.cpp / logistic.cpp / tanh.cpp / compute.cpp / gelu.cpp / silu.cpp: extend the dispatch-site `#if defined(MLAS_TARGET_AMD64) || ...` guard to include `MLAS_TARGET_RISCV64`. - cmake/onnxruntime_mlas.cmake: append activation_kernel_rvv.cpp to the `if(HAS_RISCV64_RVV)` source list (additive, no new block). Kernel strategy: LMUL=m4 throughout (32 floats per vector at VLEN=256, scales with VLEN via dynamic vsetvli). exp uses Cody-Waite range reduction + 6th-order minimax polynomial; erf/gelu use the Abramowitz & Stegun 5-term approximation (max ~2.5e-5 abs error). Silu fuses `x * sigmoid(x)` in a single pass to halve memory traffic. Signed-off-by: qiurui144 <happyqiurui@163.com>

Specialize MlasGemmQuantTryGemvKernel for MLAS_GEMM_QUANT_KERNEL_RVV with a hand-written RVV kernel covering the U8 x S8 case, mirroring the existing AVX2 specialization in qgemm_kernel_avx2.cpp. The default template (qgemm.h) returns false; this overrides it so M=1 INT8 GEMV calls bypass the PackA/PackB pipeline. Strategy: K-outer, N-tiled with LMUL=m4 e32 i32 accumulator. For each K iteration broadcast A[k] as i16 (zero-extended from u8), sign-extend B[k*ldb + n..n+vl) from i8 to i16, then widen-mul-accumulate into the i32 acc with vwmacc.vx. Acts only on the U8 x S8 combination; all other signedness combinations return false and fall through to the packed kernel as today. Microbench on K3 (SpacemiT X100, VLEN=256, 1 thread): K=N=384: scalar 1.45 autovec 2.34 RVV m4 16.68 GOPS (7.1x autovec) K=N=768: scalar 1.46 autovec 2.34 RVV m4 6.17 GOPS (2.6x autovec) K=N=4096: scalar 1.46 autovec 2.33 RVV m4 1.92 GOPS (memory-bound, no benefit) Numerical correctness: zero diff against scalar reference across random U8 x S8 inputs at K=N in {384, 768, 4096}. The kernel is invoked from MlasGemmQuantOperation only when RangeCountM == 1 and zero-points are zero (existing qgemm.h gate), so encoder workloads such as BERT seq=128 do not exercise it. The target use cases match AVX2's MlasGemvU8S8KernelAvx2: dynamic-quant LLM decode, tiny-batch inference, and any M=1 dispatch site that would otherwise pay the PackB cost on a one-row matmul. Signed-off-by: qiurui144 <happyqiurui@163.com>

qiurui144 · 2026-05-01T12:07:28Z

@hariharans29 , base on #28261 PR ,i added some patches for RVV improvements. thanks for you review.

Copilot

Pull request overview

Adds additional RISC-V RVV-optimized MLAS kernels (INT8 quantized GEMM/GEMV and several FP32 activations) and wires them into the existing riscv64 RVV runtime dispatch path to improve performance on vector-capable hardware.

Changes:

Add RVV INT8 QGEMM kernel (including an M=1 U8×S8 GEMV fast-path specialization).
Add RVV unary activation kernels (Erf/Tanh/Logistic/Exp/SiLU/GELU-erf) and route existing activation entrypoints through MLAS platform dispatch on riscv64.
Extend SGEMM M=1 routing to use the existing GEMV float kernel on riscv64 and update build wiring to compile new RVV sources.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
cmake/onnxruntime_mlas.cmake	Adds new riscv64 RVV sources to the MLAS build under `HAS_RISCV64_RVV`.
onnxruntime/core/mlas/lib/platform.cpp	Initializes riscv64 quant dispatch + activation routine pointers; switches to RVV implementations when available at runtime.
onnxruntime/core/mlas/lib/mlasi.h	Declares new RVV kernel symbols and adds riscv64 platform fields for quant dispatch and activations.
onnxruntime/core/mlas/lib/qgemm.h	Adds riscv64 4-signedness dispatch selection in `MlasGemmQuantGetDispatch`.
onnxruntime/core/mlas/lib/riscv64/qgemm_kernel_rvv.cpp	New RVV INT8 QGEMM kernel + U8×S8 GEMV M=1 fast path and RVV packing specializations.
onnxruntime/core/mlas/lib/riscv64/activation_kernel_rvv.cpp	New RVV FP32 activation kernels and a shared exp approximation.
onnxruntime/core/mlas/lib/sgemm.cpp	Enables the M=1 GEMV routing path for riscv64 when `TransB == NoTrans`.
onnxruntime/core/mlas/lib/{erf.cpp,logistic.cpp,tanh.cpp,gelu.cpp,silu.cpp,compute.cpp}	Routes riscv64 activation entrypoints through platform dispatch to enable RVV overrides.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Squashed 4 fixes per AI audit (>13M-point precision verify on K3 X100). Files touched: - mlas/lib/riscv64/qgemm_kernel_rvv.cpp: drop dead MlasGemmU8S8DispatchRvvPtr; instantiate four signedness-specific dispatch structs via macro - mlas/lib/mlasi.h: declare U8U8/S8S8/S8U8 dispatch externs - mlas/lib/platform.cpp: wire each MLAS_PLATFORM signedness slot to its self-describing symbol (mirrors AVX2Vnni convention) - mlas/lib/riscv64/activation_kernel_rvv.cpp: tighten EXP_CLAMP_MAX 88.722839f -> 87.0f to keep n*LOG2E rounding < 128 (was producing +Inf at boundary); add internal defensive clamp in exp_f32m4 so callers cannot trigger Inf via large unbounded inputs (silu/-200). Verified L1-L11 audit (>13M points) on K3 X100, 0 failures. Signed-off-by: qiurui144 <happyqiurui@163.com>

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Mirror AVX2Vnni: wire each MLAS_PLATFORM signedness slot (U8S8/U8U8/S8S8/S8U8) to its self-describing symbol instead of aliasing all four to MlasGemmU8S8DispatchRvv. The RVV INT8 kernel itself already handles all signedness via the existing MlasGemmQuantFixupZeroPoint{A,B} XOR-0x80 fixup; only the symbol naming changes. Signed-off-by: qiurui144 <happyqiurui@163.com>

At x = EXP_CLAMP_MAX (88.722839f), x*LOG2E rounds exactly to 128 inside `n = vfcvt_x_f(...)`, so (n+127)<<23 reinterpreted as float produces +Inf instead of a finite value. Tighten the clamp to +/-87.0f so n stays <= 126 and the result remains finite for the entire input range. Signed-off-by: qiurui144 <happyqiurui@163.com>

The previous design required each caller to clamp x into [-87, 87] before invoking exp_f32m4. Logistic/SiLU only clamped the lower bound, so e.g. silu(-200) computed exp(+200) internally and returned ~ -input instead of ~ 0. Move the upper/lower clamp inside exp_f32m4 itself so all callers are protected regardless of their own clamping. Signed-off-by: qiurui144 <happyqiurui@163.com>

qiurui144 added 4 commits May 1, 2026 19:51

qiurui144 mentioned this pull request May 1, 2026

[RISC-V] Add RVV INT8 GEMM and GEMV kernels (follow-up #28261) #28287

Closed

hariharans29 requested a review from Copilot May 3, 2026 01:02

Copilot started reviewing on behalf of hariharans29 May 3, 2026 01:02 View session

Copilot AI reviewed May 3, 2026

View reviewed changes

Comment thread onnxruntime/core/mlas/lib/riscv64/qgemm_kernel_rvv.cpp Outdated

Comment thread onnxruntime/core/mlas/lib/platform.cpp Outdated

Comment thread onnxruntime/core/mlas/lib/riscv64/activation_kernel_rvv.cpp

Comment thread onnxruntime/core/mlas/lib/riscv64/activation_kernel_rvv.cpp

qiurui144 and others added 4 commits May 4, 2026 23:08

Potential fix for pull request finding

d3129ec

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RISC-V] Add RVV INT8 GEMM/GEMV, M=1 routing, and activation kernels (follow-up #28261)#28308

[RISC-V] Add RVV INT8 GEMM/GEMV, M=1 routing, and activation kernels (follow-up #28261)#28308
qiurui144 wants to merge 8 commits intomicrosoft:mainfrom
qiurui144:feat/mlas-rvv-int8-gemv-activation

qiurui144 commented May 1, 2026

Uh oh!

qiurui144 commented May 1, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

qiurui144 commented May 1, 2026

Description

Performance

FP32 transformer encoders

INT8 quantized

INT8 GEMV M=1 (kernel-level micro-bench, 1 thread)

Uh oh!

qiurui144 commented May 1, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants