-
Notifications
You must be signed in to change notification settings - Fork 1.1k
cpu: rv64: add rvv batch normalization implementation using rvv intrinsics #4158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cpu: rv64: add rvv batch normalization implementation using rvv intrinsics #4158
Conversation
79c8421 to
4f779ef
Compare
mgouicem
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution. Out of curiosity, do you know if the speedup over compiler vectorization is coming from relu post-op inlining or from fully unrolling over C ?
@mgouicem Thanks for your question. We addtionally did the test1 with Results show that the major performance gain (90%) comes from the vecterization over channel dim rather than the relu post-op inlining. Table 1:Runtime Comparisons of Auto-vectorization and RVV Intrinsic in Test1 (without relu)
Table 2:Runtime Comparisons of Auto-vectorization and RVV Intrinsic in Test2 (with relu)
|
32fbc0c to
67ab206
Compare
Description
This PR introduces optimized batch normalization (bnorm) primitive for RV64 architectures using RVV (RISC-V Vector) intrinsics. The
rvv_batch_normalizationimplementation is aligned withacl_batch_normalization, providing maintainability.This initial version provides:
f32onlyGrequired,C/H/Roptional (Rrequires inference)reluwithoutalpha/betaonly (by integrating withrvv_postops.hpp)FWD_DandFWD_IImplementation Details
abx,rvv_bnormvectorizes overWdim for fixed channel. For data tag ofaxb, it vectorizes acrossCdim.scratchpadmethod, two vectorization methods with different data tags can utilize a shared intrinsic kernelbn_fwd_kernel_f32.Checklist
General
make testandmake test_benchdnn_*) pass locally for each commit?Performance improvements
All experiments are performed on a SG2044 platform including:
benchdnninput shapes of densenet_121 / googlenet_v2 / googlenet_v3 / resnet_50f32--dir=FWD_I --flags=GCHR --attr-post-ops=reluWe draw comparisons among 1st baseline method of scalar implementation, 2nd baseline method of auto vectoration by compiler, and our method of RVV intrinsic implementation.
ncsp_batch_normalizationimplementation compiled by gcc 14.2 with-march=rv64gc -O3ncsp_batch_normalizationimplementation compiled by gcc 14.2 with-march=rv64gcv -O3 -ftree-vectorizervv_batch_normalizationimplementation compiled by gcc 14.2 with-march=rv64gc -O3Results
Averagely, runtimes using rvv instrinsics have obtained a 1.24x speep up compared to those with scalar method, and a 1.18x speep up compared to those with compiler auto vectorization.
In the best test case of
mb1ic1024ih15n"densenet_121:conv4_blk/bn", rvv intrinsics have a 3.98x speep up compared to scalar method.Detailed results are as follows:
Table 1:Runtime Comparisons of Scalar and RVV Intrinsic
Table 2:Runtime Comparisons of Auto-vectorization and RVV Intrinsic