cpu: rv64: add rvv batch normalization implementation using rvv intrinsics #4158

zhangjian29 · 2025-10-16T12:18:49Z

Description

This PR introduces optimized batch normalization (bnorm) primitive for RV64 architectures using RVV (RISC-V Vector) intrinsics. The rvv_batch_normalization implementation is aligned with acl_batch_normalization, providing maintainability.

This initial version provides:

Supported memory layouts: plain, no blocks/padding
Supported data types: f32 only
Supported bnorm flags: G required, C/H/R optional (R requires inference)
Supported post ops: relu without alpha/beta only (by integrating with rvv_postops.hpp)
Supported directions: FWD_D and FWD_I

Implementation Details

Vectorization Method: For data tag of abx, rvv_bnorm vectorizes over W dim for fixed channel. For data tag of axb, it vectorizes across C dim.
Shared Intrinsic Kernel: Since data vectorized across channels are reordered using scratchpad method, two vectorization methods with different data tags can utilize a shared intrinsic kernel bn_fwd_kernel_f32.

Checklist

General

Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit?
Have you formatted the code using clang-format?

Performance improvements

Have you submitted performance data that demonstrates performance improvements?

All experiments are performed on a SG2044 platform including:

Test Cases: benchdnn input shapes of densenet_121 / googlenet_v2 / googlenet_v3 / resnet_50
Test Dtypes: f32
Test Args: --dir=FWD_I --flags=GCHR --attr-post-ops=relu

We draw comparisons among 1st baseline method of scalar implementation, 2nd baseline method of auto vectoration by compiler, and our method of RVV intrinsic implementation.

Scalar: ncsp_batch_normalization implementation compiled by gcc 14.2 with -march=rv64gc -O3
Auto Vectorization: ncsp_batch_normalization implementation compiled by gcc 14.2 with -march=rv64gcv -O3 -ftree-vectorize
RVV Intrinsic: Our rvv_batch_normalization implementation compiled by gcc 14.2 with -march=rv64gc -O3

Results

Averagely, runtimes using rvv instrinsics have obtained a 1.24x speep up compared to those with scalar method, and a 1.18x speep up compared to those with compiler auto vectorization.

In the best test case of mb1ic1024ih15n"densenet_121:conv4_blk/bn", rvv intrinsics have a 3.98x speep up compared to scalar method.

Detailed results are as follows:

Table 1：Runtime Comparisons of Scalar and RVV Intrinsic

Cases	Scalar Runtime (ms)	Intrinsic Runtime (ms)	Speedups
densenet_121	504.05	466.69	1.08
googlenet_v2	583.13	471.34	1.24
googlenet_v3	890.16	785.98	1.13
resnet_50	587.62	337.41	1.74

Table 2：Runtime Comparisons of Auto-vectorization and RVV Intrinsic

Cases	Auto-vec Runtime (ms)	Intrinsic Runtime (ms)	Speedups
densenet_121	474.74	466.69	1.02
googlenet_v2	557.66	471.34	1.18
googlenet_v3	978.28	785.98	1.24
resnet_50	412.80	337.41	1.22

zhangjian29 · 2025-11-19T03:27:43Z

Hi @vpirogov @dzarukin @mgouicem,

Looking forward to your reviews and feedbacks on this PR. Thanks.

mgouicem

Thanks for the contribution. Out of curiosity, do you know if the speedup over compiler vectorization is coming from relu post-op inlining or from fully unrolling over C ?

src/cpu/rv64/rvv_batch_normalization.cpp

src/cpu/rv64/rvv_batch_normalization.hpp

zhangjian29 · 2025-11-20T02:45:58Z

Thanks for the contribution. Out of curiosity, do you know if the speedup over compiler vectorization is coming from relu post-op inlining or from fully unrolling over C ?

@mgouicem Thanks for your question. We addtionally did the test1 with --tag=acdb --flag=G --attr-post-ops= compared with
the test2 with --tag=acdb --flag=G --attr-post-ops=relu.

Results show that the major performance gain (90%) comes from the vecterization over channel dim rather than the relu post-op inlining.

Table 1：Runtime Comparisons of Auto-vectorization and RVV Intrinsic in Test1 (without relu)

Cases	Auto-vec Runtime (ms)	Intrinsic Runtime (ms)	Speedups
densenet_121	531.39	21.37	24.87
googlenet_v2	8828.95	572.58	15.42
googlenet_v3	19713.8	1286.52	15.32
resnet_50	5713.12	341.46	16.73
total	34787.26	2221.93	15.65

Table 2：Runtime Comparisons of Auto-vectorization and RVV Intrinsic in Test2 (with relu)

Cases	Auto-vec Runtime (ms)	Intrinsic Runtime (ms)	Speedups
densenet_121	578.73	21.67	26.71
googlenet_v2	9670.51	564.31	17.14
googlenet_v3	21422.7	1279.28	16.75
resnet_50	6262.93	332.14	18.86
total	37934.87	2196.73	17.27

Test log: test_bnorm_wo_and_wi_relu.log

src/cpu/rv64/rvv_nhwc_pooling.hpp

zhangjian29 requested a review from a team as a code owner October 16, 2025 12:18

github-actions bot added platform:cpu-rv64 RISC-V component:common labels Oct 16, 2025

cpu: rv64: add rvv batch normalization integration

4f779ef

zhangjian29 force-pushed the add-rvv-batch-norm-integration branch from 79c8421 to 4f779ef Compare November 19, 2025 01:57

mgouicem approved these changes Nov 19, 2025

View reviewed changes

src/cpu/rv64/rvv_batch_normalization.cpp Outdated Show resolved Hide resolved

src/cpu/rv64/rvv_batch_normalization.hpp Show resolved Hide resolved

vpirogov reviewed Nov 20, 2025

View reviewed changes

src/cpu/rv64/rvv_nhwc_pooling.hpp Show resolved Hide resolved

cpu: rv64: bnorm: add copyright entity

67ab206

zhangjian29 force-pushed the add-rvv-batch-norm-integration branch from 32fbc0c to 67ab206 Compare November 20, 2025 06:07

zhangjian29 requested a review from vpirogov November 20, 2025 06:09

vpirogov approved these changes Nov 20, 2025

View reviewed changes

vpirogov merged commit 8feeb55 into uxlfoundation:main Nov 20, 2025
30 checks passed

zhangjian29 deleted the add-rvv-batch-norm-integration branch November 21, 2025 03:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cpu: rv64: add rvv batch normalization implementation using rvv intrinsics #4158

cpu: rv64: add rvv batch normalization implementation using rvv intrinsics #4158

Uh oh!

zhangjian29 commented Oct 16, 2025

Uh oh!

zhangjian29 commented Nov 19, 2025

Uh oh!

mgouicem left a comment

Uh oh!

Uh oh!

Uh oh!

zhangjian29 commented Nov 20, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cpu: rv64: add rvv batch normalization implementation using rvv intrinsics #4158

cpu: rv64: add rvv batch normalization implementation using rvv intrinsics #4158

Uh oh!

Conversation

zhangjian29 commented Oct 16, 2025

Description

Implementation Details

Checklist

General

Performance improvements

Results

Table 1：Runtime Comparisons of Scalar and RVV Intrinsic

Table 2：Runtime Comparisons of Auto-vectorization and RVV Intrinsic

Uh oh!

zhangjian29 commented Nov 19, 2025

Uh oh!

mgouicem left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

zhangjian29 commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Table 1：Runtime Comparisons of Auto-vectorization and RVV Intrinsic in Test1 (without relu)

Table 2：Runtime Comparisons of Auto-vectorization and RVV Intrinsic in Test2 (with relu)

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zhangjian29 commented Nov 20, 2025 •

edited

Loading