Skip to content

Conversation

@zhangjian29
Copy link
Contributor

Description

This PR introduces optimized batch normalization (bnorm) primitive for RV64 architectures using RVV (RISC-V Vector) intrinsics. The rvv_batch_normalization implementation is aligned with acl_batch_normalization, providing maintainability.

This initial version provides:

  1. Supported memory layouts: plain, no blocks/padding
  2. Supported data types: f32 only
  3. Supported bnorm flags: G required, C/H/R optional (R requires inference)
  4. Supported post ops: relu without alpha/beta only (by integrating with rvv_postops.hpp)
  5. Supported directions: FWD_D and FWD_I

Implementation Details

  • Vectorization Method: For data tag of abx, rvv_bnorm vectorizes over W dim for fixed channel. For data tag of axb, it vectorizes across C dim.
  • Shared Intrinsic Kernel: Since data vectorized across channels are reordered using scratchpad method, two vectorization methods with different data tags can utilize a shared intrinsic kernel bn_fwd_kernel_f32.

Checklist

General

  • Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit?
  • Have you formatted the code using clang-format?

Performance improvements

  • Have you submitted performance data that demonstrates performance improvements?

All experiments are performed on a SG2044 platform including:

  • Test Cases: benchdnn input shapes of densenet_121 / googlenet_v2 / googlenet_v3 / resnet_50
  • Test Dtypes: f32
  • Test Args: --dir=FWD_I --flags=GCHR --attr-post-ops=relu

We draw comparisons among 1st baseline method of scalar implementation, 2nd baseline method of auto vectoration by compiler, and our method of RVV intrinsic implementation.

  1. Scalar: ncsp_batch_normalization implementation compiled by gcc 14.2 with -march=rv64gc -O3
  2. Auto Vectorization: ncsp_batch_normalization implementation compiled by gcc 14.2 with -march=rv64gcv -O3 -ftree-vectorize
  3. RVV Intrinsic: Our rvv_batch_normalization implementation compiled by gcc 14.2 with -march=rv64gc -O3

Results

Averagely, runtimes using rvv instrinsics have obtained a 1.24x speep up compared to those with scalar method, and a 1.18x speep up compared to those with compiler auto vectorization.

In the best test case of mb1ic1024ih15n"densenet_121:conv4_blk/bn", rvv intrinsics have a 3.98x speep up compared to scalar method.

Detailed results are as follows:

Table 1:Runtime Comparisons of Scalar and RVV Intrinsic

Cases Scalar Runtime (ms) Intrinsic Runtime (ms) Speedups
densenet_121 504.05 466.69 1.08
googlenet_v2 583.13 471.34 1.24
googlenet_v3 890.16 785.98 1.13
resnet_50 587.62 337.41 1.74

Table 2:Runtime Comparisons of Auto-vectorization and RVV Intrinsic

Cases Auto-vec Runtime (ms) Intrinsic Runtime (ms) Speedups
densenet_121 474.74 466.69 1.02
googlenet_v2 557.66 471.34 1.18
googlenet_v3 978.28 785.98 1.24
resnet_50 412.80 337.41 1.22

@zhangjian29 zhangjian29 force-pushed the add-rvv-batch-norm-integration branch from 79c8421 to 4f779ef Compare November 19, 2025 01:57
@zhangjian29
Copy link
Contributor Author

Hi @vpirogov @dzarukin @mgouicem,

Looking forward to your reviews and feedbacks on this PR. Thanks.

Copy link
Contributor

@mgouicem mgouicem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution. Out of curiosity, do you know if the speedup over compiler vectorization is coming from relu post-op inlining or from fully unrolling over C ?

@zhangjian29
Copy link
Contributor Author

zhangjian29 commented Nov 20, 2025

Thanks for the contribution. Out of curiosity, do you know if the speedup over compiler vectorization is coming from relu post-op inlining or from fully unrolling over C ?

@mgouicem Thanks for your question. We addtionally did the test1 with --tag=acdb --flag=G --attr-post-ops= compared with
the test2 with --tag=acdb --flag=G --attr-post-ops=relu.

Results show that the major performance gain (90%) comes from the vecterization over channel dim rather than the relu post-op inlining.

Table 1:Runtime Comparisons of Auto-vectorization and RVV Intrinsic in Test1 (without relu)

Cases Auto-vec Runtime (ms) Intrinsic Runtime (ms) Speedups
densenet_121 531.39 21.37 24.87
googlenet_v2 8828.95 572.58 15.42
googlenet_v3 19713.8 1286.52 15.32
resnet_50 5713.12 341.46 16.73
total 34787.26 2221.93 15.65

Table 2:Runtime Comparisons of Auto-vectorization and RVV Intrinsic in Test2 (with relu)

Cases Auto-vec Runtime (ms) Intrinsic Runtime (ms) Speedups
densenet_121 578.73 21.67 26.71
googlenet_v2 9670.51 564.31 17.14
googlenet_v3 21422.7 1279.28 16.75
resnet_50 6262.93 332.14 18.86
total 37934.87 2196.73 17.27

@zhangjian29 zhangjian29 force-pushed the add-rvv-batch-norm-integration branch from 32fbc0c to 67ab206 Compare November 20, 2025 06:07
@zhangjian29 zhangjian29 requested a review from vpirogov November 20, 2025 06:09
@vpirogov vpirogov merged commit 8feeb55 into uxlfoundation:main Nov 20, 2025
30 checks passed
@zhangjian29 zhangjian29 deleted the add-rvv-batch-norm-integration branch November 21, 2025 03:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants