Optimize fvec_inner_product for AArch64 with NEON intrinsics #4689

marma01 · 2025-11-20T09:22:48Z

Optimized fvec_inner_product for AArch64 with NEON intrinsics and an 8-way unrolled loop.

Benchmarks (HNSW_IP index build on GIST1M) show ~37% faster build time on AWS m8g.16xlarge (Graviton4).

	Before	After	uplift
Build time (ms)	381,110.832	239,374.388	37%

Benchmark script (measures both build and search time)

import time
import sys
import numpy as np
import faiss

try:
    from faiss.contrib.datasets_fb import DatasetGIST1M
except ImportError:
    from faiss.contrib.datasets import DatasetGIST1M

k = 10

print("load data")

ds = DatasetGIST1M()

xq = ds.get_queries()
xb = ds.get_database()
gt = ds.get_groundtruth()
xt = ds.get_train()

nq, d = xq.shape

def evaluate(index):

    t0 = time.time()
    D, I = index.search(xq, k)
    t1 = time.time()

    missing_rate = (I == -1).sum() / float(k * nq)
    recall_at_1 = (I == gt[:, :1]).sum() / float(nq)
    print("\t %7.3f ms per query, R@1 %.4f, missing rate %.4f" % (
        (t1 - t0) * 1000.0 / nq, recall_at_1, missing_rate))

print("Testing HNSW Flat (Inner Product)")

# Regenerate IP groundtruth
print("Regenerate Inner Product groundtruth...")
gt_index = faiss.IndexFlatIP(d)
gt_index.add(xb)
_, gt = gt_index.search(xq, k)

index = faiss.IndexHNSWFlat(d, 32, faiss.METRIC_INNER_PRODUCT)
index.hnsw.efConstruction = 40

print("add ")
index.verbose = True
index.add(xb)

print("search")
for efSearch in (16, 32, 64, 128, 256):
    for bounded_queue in (True, False):
        print("efSearch", efSearch, "bounded queue", bounded_queue, end=' ')
        index.hnsw.search_bounded_queue = bounded_queue
        index.hnsw.efSearch = efSearch
        evaluate(index)

alexanderguzhva · 2025-11-20T15:14:27Z

faiss/utils/distances_simd.cpp

+    float32x4_t tmp2 = vaddq_f32(sum[4], sum[5]);
+    float32x4_t tmp3 = vaddq_f32(sum[6], sum[7]);
+
+    float32x4_t total = vaddq_f32(vaddq_f32(tmp0, tmp1), vaddq_f32(tmp2, tmp3));


this code is suboptimal for d < 32, which happens inside faiss, because of many unused vaddq_f32 operations in this case. Please add a more careful handling on when to enable this manual loop unrolling.

Also, could you please confirm that a modern compiler (not something like GCC 9) really generates a worse code vs this hand-written one, because I have an impression that a modern compiler can optimize a dot product computation pretty good nowadays?
Thanks

Thanks, I will add a check to enable manual loop unrolling only when d >= 32.
Regarding the compiler: yes, modern compilers do generate SIMD instructions, but they typically don’t apply deep unrolling or use multiple accumulators like the hand-written 8-way NEON version. This limits instruction-level parallelism and throughput.
For reference, here is the generated assembly from GCC 14 and Clang 19: https://godbolt.org/z/P9zPTPasa
I also ran micro-benchmarks comparing the original code and the manual NEON version, the manual NEON code achieved better performance, with the uplift increasing as the dimension grows.

- Add a check to enable manual loop unrolling only when d >= 32. - Format code according to clang-format

marma01 · 2025-11-25T09:57:03Z

Updated the code to:

Add a check to enable manual loop unrolling only when d >= 32
Format code according to clang-format

pankajsingh88 · 2025-11-25T23:07:55Z

@subhadeepkaran will this be blocked on internal SIMD optimization related refactors?

subhadeepkaran · 2025-11-25T23:13:08Z

@subhadeepkaran will this be blocked on internal SIMD optimization related refactors?

yep this file has also undergone a bunch of change as part of dynamic dispatch
so for now we can hold on this PR and as soon as we commit DD changes we rebase and review it

meta-cla bot added the CLA Signed label Nov 20, 2025

alexanderguzhva reviewed Nov 20, 2025

View reviewed changes

Optimize fvec_inner_product for AArch64 with NEON intrinsics

5c2554d

- Add a check to enable manual loop unrolling only when d >= 32. - Format code according to clang-format

marma01 force-pushed the optimize_fvec_inner_product branch from 9405576 to 5c2554d Compare November 25, 2025 09:49

pankajsingh88 requested a review from subhadeepkaran November 25, 2025 23:27

mnorris11 assigned subhadeepkaran Dec 8, 2025

mnorris11 added the backlog label Dec 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize fvec_inner_product for AArch64 with NEON intrinsics #4689

Optimize fvec_inner_product for AArch64 with NEON intrinsics #4689

Uh oh!

marma01 commented Nov 20, 2025

Uh oh!

alexanderguzhva Nov 20, 2025

Uh oh!

marma01 Nov 25, 2025

Uh oh!

marma01 commented Nov 25, 2025

Uh oh!

pankajsingh88 commented Nov 25, 2025

Uh oh!

subhadeepkaran commented Nov 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Optimize fvec_inner_product for AArch64 with NEON intrinsics #4689

Are you sure you want to change the base?

Optimize fvec_inner_product for AArch64 with NEON intrinsics #4689

Uh oh!

Conversation

marma01 commented Nov 20, 2025

Uh oh!

alexanderguzhva Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

marma01 Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

marma01 commented Nov 25, 2025

Uh oh!

pankajsingh88 commented Nov 25, 2025

Uh oh!

subhadeepkaran commented Nov 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants