Skip to content

[RISC-V] Add RVV INT8 GEMM/GEMV, M=1 routing, and activation kernels (follow-up #28261)#28308

Open
qiurui144 wants to merge 8 commits intomicrosoft:mainfrom
qiurui144:feat/mlas-rvv-int8-gemv-activation
Open

[RISC-V] Add RVV INT8 GEMM/GEMV, M=1 routing, and activation kernels (follow-up #28261)#28308
qiurui144 wants to merge 8 commits intomicrosoft:mainfrom
qiurui144:feat/mlas-rvv-int8-gemv-activation

Conversation

@qiurui144
Copy link
Copy Markdown

Description

Follow-up to #28261 (RVV CPU EP). Four additive commits on top of the
existing if(HAS_RISCV64_RVV) block in cmake/onnxruntime_mlas.cmake:

  1. INT8 GEMM (riscv64/qgemm_kernel_rvv.cpp) — vwmulu.vv /
    vwaddu.wv widening; wired through MLAS_PLATFORM 4-signedness
    dispatch (LARCH64 idiom).
  2. M=1 SGEMM routing — extends the ARM64/WASM MlasGemvFloatKernel
    #elif to RISCV64; kernel comes from existing SgemvKernelScalar.cpp
    (rv-gcc autovec).
  3. Activation kernels (riscv64/activation_kernel_rvv.cpp) — RVV
    Erf, Tanh, Logistic, ComputeExpF32, Silu, GeluErf.
  4. INT8 GEMV M=1 fast pathMlasGemmQuantTryGemvKernel<...RVV>
    specialization (mirrors AVX2 qgemm_kernel_avx2.cpp:131); U8×S8 only.

Performance

K3 (SpacemiT X100, VLEN=256, 4 threads, p50 ms over 30 reps, real
BAAI/bge-* ONNX inputs). Baseline = microsoft/onnxruntime post-#28261
(62f742f1aa).
Cross-built with riscv64-linux-gnu-g++ 15.2 + -march=rv64gcv -mabi=lp64d.

FP32 transformer encoders

Model Upstream This PR Speedup
BAAI/bge-small-zh-v1.5 66.3 ms 63.8 ms 1.04×
BAAI/bge-base-zh-v1.5 404.4 ms 393.5 ms 1.03×
BAAI/bge-reranker-base 403.7 ms 391.7 ms 1.03×

INT8 quantized

Model Upstream This PR Speedup
BAAI/bge-small-zh-v1.5 INT8 301.3 ms 131.3 ms 2.29×
BAAI/bge-base-zh-v1.5 INT8 1958.8 ms 669.2 ms 2.93×
BAAI/bge-reranker-base INT8 1956.8 ms 668.6 ms 2.93×

INT8 GEMV M=1 (kernel-level micro-bench, 1 thread)

Shape scalar autovec this PR vs autovec
K=N=384 1.45 GOPS 2.34 GOPS 16.68 GOPS 7.13×
K=N=768 1.46 GOPS 2.34 GOPS 6.17 GOPS 2.64×
K=N=4096 1.46 GOPS 2.33 GOPS 1.92 GOPS 0.82× (memory-bound)

The GEMV path triggers only when RangeCountM == 1 and zero-points are
zero (qgemm.h:331 gate); BERT seq=128 encoders do not exercise it, so
its contribution is not visible in the e2e tables above.

qiurui144 added 4 commits May 1, 2026 19:51
Add onnxruntime/core/mlas/lib/riscv64/qgemm_kernel_rvv.cpp, a
standard-RVV (baseline V extension, VLEN>=128, dynamic vsetvli) INT8
GEMM kernel using the vwmulu.vv + vwaddu.wv widening pattern. Works for
any VLEN without rebuild.

Wired into the existing RISCV64 RVV build block introduced by microsoft#28261:
- cmake/onnxruntime_mlas.cmake: append qgemm_kernel_rvv.cpp to the
  if(HAS_RISCV64_RVV) source list (additive, no new block).
- qgemm.h: add an MLAS_TARGET_RISCV64 dispatch branch that selects
  MlasGemmU8S8DispatchRvv for all four (A,B) signedness combinations,
  matching the inline-extern style used by ARM64EC / WASM_SIMD /
  S390X branches above it.

Measured K3 (SpacemiT X100, VLEN=256, 8T): bge-small INT8 kernel
throughput ~2.5x vs scalar default. FP32 bge-small no-dispatch P50
stays at 89ms (unchanged from upstream main; no regression).

Signed-off-by: qiurui144 <happyqiurui@163.com>
The existing M=1 fast path in MlasGemmBatch already routes to
MlasGemvFloatKernel() for ARM64/WASM when TransB == CblasNoTrans;
extend the '#elif' guard in sgemm.cpp to include RISCV64 and the
forward declaration in mlasi.h alongside the ARM64/WASM branch.

Symbol is provided by the existing scalar/SgemvKernelScalar.cpp,
which compiles cleanly under rv-gcc 15.2 with autovectorisation.
Hand-written RVV intrinsics for this kernel were evaluated and
discarded: a 4x K-unrolled LMUL=m4 implementation produced
2.42 GFLOPS at K=N=768 vs 6.85 GFLOPS for the rv-gcc 15.2 autovec
output of the scalar source on the same hardware (SpacemiT X100,
VLEN=256). The dual-issue in-order pipeline cannot hide the
dependency chain between vle32.v + vfmacc.vf reusing the same
m4 register file; autovec's narrower LMUL=m1 with independent
vector regs per iteration sustains higher utilisation. This
matches existing CLAUDE.md/contrib guidance to prefer portable
implementations unless hand-written intrinsics show >=1.3x kernel
speedup.

Signed-off-by: qiurui144 <happyqiurui@163.com>
Add an RVV-vectorised activation/compute family at
  onnxruntime/core/mlas/lib/riscv64/activation_kernel_rvv.cpp

covering Erf, Tanh, Logistic, ComputeExpF32, Silu and GeluErf. Wired
into the dispatch framework introduced by microsoft#28261:

- mlasi.h: extend the existing
  `MLAS_TARGET_RISCV64 && MLAS_USE_RVV` kernel-decl block with the six
  new symbols (Erf, Logistic, GeluErf, Silu, Tanh, ComputeExpF32),
  and add four MLAS_PLATFORM dispatch fields
  (GeluErfKernelRoutine, SiluKernelRoutine, TanhKernelRoutine,
  ComputeExpF32Kernel) under a RISCV64-only block.
- platform.cpp: in the RISCV64 init block, default-assign the four new
  fields to the upstream scalar kernels and override them with the RVV
  variants inside the existing `if (has_rvv)` gate.
- erf.cpp / logistic.cpp / tanh.cpp / compute.cpp / gelu.cpp / silu.cpp:
  extend the dispatch-site `#if defined(MLAS_TARGET_AMD64) || ...`
  guard to include `MLAS_TARGET_RISCV64`.
- cmake/onnxruntime_mlas.cmake: append activation_kernel_rvv.cpp to the
  `if(HAS_RISCV64_RVV)` source list (additive, no new block).

Kernel strategy: LMUL=m4 throughout (32 floats per vector at
VLEN=256, scales with VLEN via dynamic vsetvli). exp uses Cody-Waite
range reduction + 6th-order minimax polynomial; erf/gelu use the
Abramowitz & Stegun 5-term approximation (max ~2.5e-5 abs error).
Silu fuses `x * sigmoid(x)` in a single pass to halve memory traffic.

Signed-off-by: qiurui144 <happyqiurui@163.com>
Specialize MlasGemmQuantTryGemvKernel for MLAS_GEMM_QUANT_KERNEL_RVV
with a hand-written RVV kernel covering the U8 x S8 case, mirroring
the existing AVX2 specialization in qgemm_kernel_avx2.cpp. The default
template (qgemm.h) returns false; this overrides it so M=1 INT8 GEMV
calls bypass the PackA/PackB pipeline.

Strategy: K-outer, N-tiled with LMUL=m4 e32 i32 accumulator. For each
K iteration broadcast A[k] as i16 (zero-extended from u8), sign-extend
B[k*ldb + n..n+vl) from i8 to i16, then widen-mul-accumulate into the
i32 acc with vwmacc.vx. Acts only on the U8 x S8 combination; all other
signedness combinations return false and fall through to the packed
kernel as today.

Microbench on K3 (SpacemiT X100, VLEN=256, 1 thread):

  K=N=384:   scalar 1.45  autovec 2.34  RVV m4 16.68 GOPS  (7.1x autovec)
  K=N=768:   scalar 1.46  autovec 2.34  RVV m4  6.17 GOPS  (2.6x autovec)
  K=N=4096:  scalar 1.46  autovec 2.33  RVV m4  1.92 GOPS  (memory-bound,
                                                            no benefit)

Numerical correctness: zero diff against scalar reference across
random U8 x S8 inputs at K=N in {384, 768, 4096}.

The kernel is invoked from MlasGemmQuantOperation only when
RangeCountM == 1 and zero-points are zero (existing qgemm.h gate),
so encoder workloads such as BERT seq=128 do not exercise it. The
target use cases match AVX2's MlasGemvU8S8KernelAvx2: dynamic-quant
LLM decode, tiny-batch inference, and any M=1 dispatch site that
would otherwise pay the PackB cost on a one-row matmul.

Signed-off-by: qiurui144 <happyqiurui@163.com>
@qiurui144
Copy link
Copy Markdown
Author

@hariharans29 , base on #28261 PR ,i added some patches for RVV improvements. thanks for you review.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds additional RISC-V RVV-optimized MLAS kernels (INT8 quantized GEMM/GEMV and several FP32 activations) and wires them into the existing riscv64 RVV runtime dispatch path to improve performance on vector-capable hardware.

Changes:

  • Add RVV INT8 QGEMM kernel (including an M=1 U8×S8 GEMV fast-path specialization).
  • Add RVV unary activation kernels (Erf/Tanh/Logistic/Exp/SiLU/GELU-erf) and route existing activation entrypoints through MLAS platform dispatch on riscv64.
  • Extend SGEMM M=1 routing to use the existing GEMV float kernel on riscv64 and update build wiring to compile new RVV sources.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
cmake/onnxruntime_mlas.cmake Adds new riscv64 RVV sources to the MLAS build under HAS_RISCV64_RVV.
onnxruntime/core/mlas/lib/platform.cpp Initializes riscv64 quant dispatch + activation routine pointers; switches to RVV implementations when available at runtime.
onnxruntime/core/mlas/lib/mlasi.h Declares new RVV kernel symbols and adds riscv64 platform fields for quant dispatch and activations.
onnxruntime/core/mlas/lib/qgemm.h Adds riscv64 4-signedness dispatch selection in MlasGemmQuantGetDispatch.
onnxruntime/core/mlas/lib/riscv64/qgemm_kernel_rvv.cpp New RVV INT8 QGEMM kernel + U8×S8 GEMV M=1 fast path and RVV packing specializations.
onnxruntime/core/mlas/lib/riscv64/activation_kernel_rvv.cpp New RVV FP32 activation kernels and a shared exp approximation.
onnxruntime/core/mlas/lib/sgemm.cpp Enables the M=1 GEMV routing path for riscv64 when TransB == NoTrans.
onnxruntime/core/mlas/lib/{erf.cpp,logistic.cpp,tanh.cpp,gelu.cpp,silu.cpp,compute.cpp} Routes riscv64 activation entrypoints through platform dispatch to enable RVV overrides.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread onnxruntime/core/mlas/lib/riscv64/qgemm_kernel_rvv.cpp Outdated
Comment thread onnxruntime/core/mlas/lib/platform.cpp Outdated
Comment thread onnxruntime/core/mlas/lib/riscv64/activation_kernel_rvv.cpp
Comment thread onnxruntime/core/mlas/lib/riscv64/activation_kernel_rvv.cpp
qiurui144 added a commit to qiurui144/onnxruntime that referenced this pull request May 4, 2026
Squashed 4 fixes per AI audit (>13M-point precision verify on K3 X100).

Files touched:

- mlas/lib/riscv64/qgemm_kernel_rvv.cpp: drop dead MlasGemmU8S8DispatchRvvPtr;

  instantiate four signedness-specific dispatch structs via macro

- mlas/lib/mlasi.h: declare U8U8/S8S8/S8U8 dispatch externs

- mlas/lib/platform.cpp: wire each MLAS_PLATFORM signedness slot to

  its self-describing symbol (mirrors AVX2Vnni convention)

- mlas/lib/riscv64/activation_kernel_rvv.cpp: tighten EXP_CLAMP_MAX

  88.722839f -> 87.0f to keep n*LOG2E rounding < 128 (was producing

  +Inf at boundary); add internal defensive clamp in exp_f32m4 so

  callers cannot trigger Inf via large unbounded inputs (silu/-200).

Verified L1-L11 audit (>13M points) on K3 X100, 0 failures.

Signed-off-by: qiurui144 <happyqiurui@163.com>
qiurui144 added a commit to qiurui144/onnxruntime that referenced this pull request May 4, 2026
Squashed 4 fixes per AI audit (>13M-point precision verify on K3 X100).

Files touched:

- mlas/lib/riscv64/qgemm_kernel_rvv.cpp: drop dead MlasGemmU8S8DispatchRvvPtr;

  instantiate four signedness-specific dispatch structs via macro

- mlas/lib/mlasi.h: declare U8U8/S8S8/S8U8 dispatch externs

- mlas/lib/platform.cpp: wire each MLAS_PLATFORM signedness slot to

  its self-describing symbol (mirrors AVX2Vnni convention)

- mlas/lib/riscv64/activation_kernel_rvv.cpp: tighten EXP_CLAMP_MAX

  88.722839f -> 87.0f to keep n*LOG2E rounding < 128 (was producing

  +Inf at boundary); add internal defensive clamp in exp_f32m4 so

  callers cannot trigger Inf via large unbounded inputs (silu/-200).

Verified L1-L11 audit (>13M points) on K3 X100, 0 failures.

Signed-off-by: qiurui144 <happyqiurui@163.com>
qiurui144 and others added 4 commits May 4, 2026 23:08
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Mirror AVX2Vnni: wire each MLAS_PLATFORM signedness slot
(U8S8/U8U8/S8S8/S8U8) to its self-describing symbol instead
of aliasing all four to MlasGemmU8S8DispatchRvv. The RVV
INT8 kernel itself already handles all signedness via the
existing MlasGemmQuantFixupZeroPoint{A,B} XOR-0x80 fixup;
only the symbol naming changes.

Signed-off-by: qiurui144 <happyqiurui@163.com>
At x = EXP_CLAMP_MAX (88.722839f), x*LOG2E rounds exactly to
128 inside `n = vfcvt_x_f(...)`, so (n+127)<<23 reinterpreted
as float produces +Inf instead of a finite value. Tighten the
clamp to +/-87.0f so n stays <= 126 and the result remains
finite for the entire input range.

Signed-off-by: qiurui144 <happyqiurui@163.com>
The previous design required each caller to clamp x into
[-87, 87] before invoking exp_f32m4. Logistic/SiLU only
clamped the lower bound, so e.g. silu(-200) computed
exp(+200) internally and returned ~ -input instead of ~ 0.
Move the upper/lower clamp inside exp_f32m4 itself so all
callers are protected regardless of their own clamping.

Signed-off-by: qiurui144 <happyqiurui@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants