-
Notifications
You must be signed in to change notification settings - Fork 681
[searchlib] NFC: prefetch tensors in HNSW index search #35057
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[searchlib] NFC: prefetch tensors in HNSW index search #35057
Conversation
|
Hi @boeker ! I see you've committed a plenty into hnsw_index.cpp recently, would you be able to give this PR a look? |
|
Thanks for the detailed and very interesting writeup! We'll get to reviewing this as soon as time permits.
I was inspired by your inspiration 🙂 and decided to implement an explicitly vectorized binary hamming distance function (via Highway) in #35073. On NEON it beats the auto-vectorized code by ~1.6x on 128-byte vectors and ~2.1x for 8192-byte vectors. Would be very interested in hearing what vector length 1.8x was observed on, and your approach for getting there. On SVE/SVE2 I get a ~2.1x speedup for 128 bytes. For 8192 bytes SVE(2) beats the auto-vectorized code by ~3x. Difference on x64 AVX3-DL (AVX-512 + Note: these vector kernels are not yet enabled by default—they will be soon. Benchmarked on an AWS Graviton 4 node using benchmark functionality added as part of #35073: |
|
Cool!
I think it was a single-file gbench with copy-pasted I think this difference between x1.8 and x1.6 has do to with jumping through |
|
Hi @vekterli ! Did you have a chance to give this PR a closer look? |
|
Sorry about the delay, we've been rather busy 😓 Hoping we'll get around to it this week. My gut feeling is that we may want to template the search layer functions (filtered and unfiltered) on some kind of prefetching policy and then have a runtime decision on whether we want to actually prefetch anything, based on config and/or the size of the graph. In my anecdotal experience, explicitly prefetching often makes the performance go down if enough stuff is already present in the cache hierarchy, which may be the case for small graphs. But at some point the curves will intersect and prefetching should present an increasingly visible gain. Could also be an interesting experiment to see if prefetching into only caches > L1 would be beneficial. L1D is comparatively tiny, so when prefetching many vectors we may (emphasis: may—needs benchmarking!) risk evicting useful stuff from it that we'll end up needing before we actually get around to using the vectors themselves. |
|
I mostly agree, and I happen to share the same anecdotal experience.
Sounds very reasonable to me, although I believe that prefetching in The prefetching policy you mentioned: I assume it has to make a run-time decision base on a
, right? |
To avoid falling for the delicious temptation to make a complex policy I think that an initial implementation should probably be one where the prefetching decision is made by the query and/or the rank profile rather than being deduced by the code. This lets us easily do performance testing with/without prefetching for various scenarios without having to recompile or reconfigure the entire system. One question regarding the diff; it adds prefetching to |
vekterli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay, lots of stuff popping up... 👀
As I alluded to in the TensorBufferStore code, I have a concern that we risk polluting the caches with the current approach when there are many subspaces. Ideally we should start out with tensors stored in DenseTensorStore to avoid this risk, since dense tensors do not have multiple subspaces.
Re: my earlier comment/question, it is not entirely clear to me why your seemingly dense tensors ended up instantiating a SerializedFastValueAttribute, so that would be good to figure out first.
| return df.calc(rhs); | ||
| } | ||
|
|
||
| void prefetch_docid(const DocVectorAccess& vectors, std::uint32_t docid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We generally do not use std:: prefixes with fundamental numeric types; prefer uint32_t instead.
| SerializedFastValueAttribute::prefetch_docid(uint32_t docid) const noexcept | ||
| { | ||
| const auto* storage_start = std::addressof(_refVector.acquire_elem_ref(0)); | ||
| __builtin_prefetch(storage_start + docid); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a bit too leaky with regards to RcuVector's implementation. Consider adding prefetch_elem_ref to RcuVectorBase to hide the details
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me, will do
| .neighbor_ref = neighbor_ref, | ||
| .neighbor_docid = neighbor_docid, | ||
| .neighbor_subspace = neighbor_subspace, | ||
| }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider just using emplace_back instead of designated initializers in a temporary
|
|
||
| for (std::size_t i = 0; i < buf.size(); i += 64) { | ||
| __builtin_prefetch(buf.data() + i); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have concerns about this inherently prefetching everything across all subspaces, which can pollute the caches with stuff that will not be used (especially L1d)... Conceptually, sparse subspaces are unbounded, so this has the potential of blowing all cache hierarchies out of the water.
The first cache line of buf holds the subspace header, so maybe that should be prefetched first, then prefetch the actual subspace afterwards (get_vector(docid, subspace)).
However, I think any vector prefetching functionality should first be implemented for explicitly dense tensors, as we always want to look at the entire buffer for those.
|
Sorry for the late reply, I also got consumed by other things. Regarding the tensor type: it actually is defined as I am also not sure that prefetching the whole tensor data is a good thing to do due to exactly the same concerns you outlined above. Initially I tried prefetching just the first 4 cache-lines, but that felt arbitrary, so I ended up with the current approach with no noticeable difference. Unfortunately, I won't be able to easily measure this change with dense tensors, as there aren't any in my setup. |
caaa801 to
dd22a05
Compare
|
I've addressed your inline comments and implemented a simple on/off policy for the prefetching. Please forgive the force-push: I'm upstreaming these changes from a fork that considerably lags behind, and rebasing on top of current master turned out to be non-trivial due to recent changes. |
|
Also, I've got some follow-up work which also does prefetching in ranking (when accessing attributes values, tensors etc.) and it show improvements in my specific setup, so probably we could rename the |
I confirm that this contribution is made under the terms of the license found in the root directory of this repository's source tree and that I have the authority necessary to make this contribution on behalf of its copyright owner.
TLDR: ~x2 speedup of HNSW index by
__builtin_prefetch-ing tensorsHi!

I was benchmarking my Vespa setup the other day, and decided to get a flamegraph of what proton is doing:
As expected, a lot of cpu cycles are being spent in HNSW index doing some HNSW things, one of such things being
binary_hamming_distance. I decided to give it a closer look and soon realized that in my setup (gcc 14.2, aarch64) gcc fails miserably to produce unrolled/vectorized version of the code, and that it could probably be improved by either SimSIMD or hand-written intrinsincs. Inspired, I implemented a version ofbinary_hamming_distancethat beats current code ~x1.8 in micro-benchmarks, deployed it, benchmarked... and saw no difference at all, none, zero.That confused me, so I decided to give another function a look:
get_num_subspaces_and_flag. Well, that's a one (3 with asserts) instruction function, huh? Could it be the overhead of it being in another TU and actually requiring acall? - well.. unlikely. That confused me even moreand I went on to check if my perf is working okAt some point I realized that I have a HNSW with ~1B of

tensor<int8>(d1[128]), which is a lot of memory with very unpredictable access pattern, and that 1 instructionget_num_subspaces_and_flagfunction is actually a memory load, so I built a flamegraph for last-level-cache misses (perf record -e LLC-load-misses), and suddenly everything made sense:As one can see, flamegraphs for cpu-cycles and LLC-load-misses look basically the same for HNSW index.
Looking closer at perf data for LLC I concluded that
search::tensor::SerializedFastValueAttribute::get_vectoris coming from herevespa/searchlib/src/vespa/searchlib/tensor/serialized_fast_value_attribute.cpp
Line 47 in 7691c8e
get_num_subspaces_and_flagis a load from tensor bufferbinary_hamming_distanceare also loads from tensor bufferThe good thing about HNSW memory access pattern is that although it's very hard for hardware to predict and prefetch, we know exactly what memory we will have to access: given a vertex in the graph, check all its neighbors , thus we can rearrange how we walk the neighbors in such a way that with some hints to hardware there would be way less misses:
TensorAttribute::_refVectorTensorBufferOperations(which requires a load fromTensorAttribute::_refVector, but hopefully at this point the memory would already be brought to caches)That's a lot of "hopefully", but that's just how prefetching hints work, and turns out they work wonders: with the patch applied flamegraphs for proton look like this

with the LLC-load-misses flamegraph looking almost the same as it did, as expected (LLC-misses from prefetching are still there, but now they are async and don't stall us that much)

Comparing trunk cpu-cycles flamegraph with patch cpu-cycles flamegraph it looks like there are ~x2 less cycles spent in HSNW index, which amount to ~10% of total cpu cycles spent, and that matches with CPU usage/timings I'm observing when benchmarking.
All benchmarks were conducted at commit b60aa0d, ~1B of
tensor<int8>(d1[128]), AWS Graviton4