-
Notifications
You must be signed in to change notification settings - Fork 4.1k
Optimize fvec_inner_product for AArch64 with NEON intrinsics #4689
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
faiss/utils/distances_simd.cpp
Outdated
| float32x4_t tmp2 = vaddq_f32(sum[4], sum[5]); | ||
| float32x4_t tmp3 = vaddq_f32(sum[6], sum[7]); | ||
|
|
||
| float32x4_t total = vaddq_f32(vaddq_f32(tmp0, tmp1), vaddq_f32(tmp2, tmp3)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this code is suboptimal for d < 32, which happens inside faiss, because of many unused vaddq_f32 operations in this case. Please add a more careful handling on when to enable this manual loop unrolling.
Also, could you please confirm that a modern compiler (not something like GCC 9) really generates a worse code vs this hand-written one, because I have an impression that a modern compiler can optimize a dot product computation pretty good nowadays?
Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I will add a check to enable manual loop unrolling only when d >= 32.
Regarding the compiler: yes, modern compilers do generate SIMD instructions, but they typically don’t apply deep unrolling or use multiple accumulators like the hand-written 8-way NEON version. This limits instruction-level parallelism and throughput.
For reference, here is the generated assembly from GCC 14 and Clang 19: https://godbolt.org/z/P9zPTPasa
I also ran micro-benchmarks comparing the original code and the manual NEON version, the manual NEON code achieved better performance, with the uplift increasing as the dimension grows.
- Add a check to enable manual loop unrolling only when d >= 32. - Format code according to clang-format
9405576 to
5c2554d
Compare
|
Updated the code to:
|
|
@subhadeepkaran will this be blocked on internal SIMD optimization related refactors? |
yep this file has also undergone a bunch of change as part of dynamic dispatch |
Optimized
fvec_inner_productfor AArch64 with NEON intrinsics and an 8-way unrolled loop.Benchmarks (HNSW_IP index build on GIST1M) show ~37% faster build time on AWS m8g.16xlarge (Graviton4).
Benchmark script (measures both build and search time)