Skip to content

Conversation

@seunghwak
Copy link

@seunghwak seunghwak commented Nov 5, 2025

This PR converts

MAX_ITOPK & MAX_CANDIDATES in single-CTA search and
MAX_ELEMENTS in multi-CTA search to runtime parameters.

Cut libcuvs.so size (CUDA 13, build.sh --allgpuarch -n libcuvs) from 459 MB to 350 MB.

For correctness testing,
ran

NEIGHBORS_ANN_CAGRA_FLOAT_UTIN32_TEST
NEIGHBORS_ANN_CAGRA_HALF_UTIN32_TEST
NEIGHBORS_ANN_CAGRA_UINT8_UTIN32_TEST
NEIGHBORS_ANN_CAGRA_INT8_UTIN32_TEST
NEIGHBORS_ANN_CAGRA_HELPER_TEST
NEIGHBORS_ANN_CAGRA_TEST_BUGS

To evaluate performance impact,
ran cuvs_bench with batch sizes (10, 100, 1000, 10000) and k (10 and 100) (default options for other parameters) and deep-image-96-inner

python -m cuvs_bench.run --dataset deep-image-96-inner --algorithms cuvs_cagra --batch_size 10|100|1000|10000 - k 10|100

for both single-CTA and multi-CTA searches.

Performance impacts varies based on MAX_ITOPK and MAX_CANDIDATES combinations but performance numbers were roughly comparable (slightly slower in average with the maximum slowdown around 10%).

Let me know if there are other benchmarks I need to run to test performance.

Some performance logs from the original code and the updated code for anyone interested (batch size = 10K, K=100).

Original (main)

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                                             Time             CPU   Iterations        GPU    Latency     Recall end_to_end items_per_second      itopk          k max_iterations  n_queries search_width total_queries
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/0/process_time/real_time          3.62 ms         3.62 ms          194   3.61323m   3.62005m   0.484826   0.702289       2.76241M/s        128        100             20        10k            1         1.94M algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/1/process_time/real_time          19.1 ms         19.1 ms           37  0.0190608  0.0190676   0.782904   0.705502       524.452k/s        128        100             20        10k            1          370k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/2/process_time/real_time          6.80 ms         6.80 ms          103   6.79127m   6.79774m   0.678528   0.700167       1.47108M/s        128        100             20        10k            2         1030k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/3/process_time/real_time          19.1 ms         19.1 ms           37  0.0190677  0.0190756    0.78237   0.705797       524.234k/s        128        100             20        10k            2          370k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/4/process_time/real_time          13.5 ms         13.5 ms           52  0.0134807  0.0134903   0.829391   0.701498       741.275k/s        128        100             20        10k            4          520k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/5/process_time/real_time          19.1 ms         19.1 ms           37  0.0190805  0.0190876   0.782356   0.706242       523.903k/s        128        100             20        10k            4          370k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/6/process_time/real_time          28.5 ms         28.5 ms           25  0.0284935  0.0285032   0.921725    0.71258        350.84k/s        128        100             20        10k            8          250k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/7/process_time/real_time          34.9 ms         34.9 ms           20  0.0348447  0.0348594   0.907621   0.697188       286.869k/s        128        100             20        10k            8          200k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/8/process_time/real_time          36.2 ms         36.2 ms           19   0.036145  0.0361537   0.929581   0.686921       276.598k/s        128        100             20        10k           16          190k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/9/process_time/real_time          62.9 ms         63.0 ms           11  0.0629066  0.0629149   0.965499   0.692063       158.946k/s        128        100             20        10k           16          110k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/10/process_time/real_time         44.5 ms         44.6 ms           16  0.0445286  0.0445393   0.938481   0.712628       224.522k/s        128        100             20        10k           32          160k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/11/process_time/real_time          119 ms          119 ms            6   0.119414   0.119426   0.989163   0.716554       83.7345k/s        128        100             20        10k           32           60k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/12/process_time/real_time         65.4 ms         65.5 ms           11  0.0654209  0.0654292   0.952963   0.719721       152.838k/s        128        100             20        10k           64          110k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/13/process_time/real_time          230 ms          230 ms            3   0.229596   0.229609   0.997039   0.688826       43.5526k/s        128        100             20        10k           64           30k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/14/process_time/real_time         4.47 ms         4.48 ms          160   4.46723m   4.47379m   0.493304   0.715806       2.23525M/s        256        100             20        10k            1          1.6M algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/15/process_time/real_time         35.0 ms         35.0 ms           20  0.0349487  0.0349571   0.907623   0.699142       286.066k/s        256        100             20        10k            1          200k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/16/process_time/real_time         7.21 ms         7.21 ms           99   7.20385m   7.21088m   0.683873   0.713877        1.3868M/s        256        100             20        10k            2          990k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/17/process_time/real_time         34.9 ms         35.0 ms           20  0.0349346   0.034943   0.907618    0.69886       286.182k/s        256        100             20        10k            2          200k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/18/process_time/real_time         13.5 ms         13.5 ms           52   0.013497  0.0135039   0.832033   0.702202       740.531k/s        256        100             20        10k            4          520k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/19/process_time/real_time         34.9 ms         34.9 ms           20  0.0348956  0.0349025    0.90743    0.69805       286.514k/s        256        100             20        10k            4          200k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/20/process_time/real_time         27.6 ms         27.6 ms           25   0.027547  0.0275539    0.92417   0.688848       362.926k/s        256        100             20        10k            8          250k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/21/process_time/real_time         34.9 ms         35.0 ms           20  0.0349296  0.0349367   0.907559   0.698733       286.234k/s        256        100             20        10k            8          200k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/22/process_time/real_time         62.5 ms         62.5 ms           12  0.0624942  0.0625044   0.969736   0.750053        159.99k/s        256        100             20        10k           16          120k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/23/process_time/real_time         63.0 ms         63.0 ms           11  0.0629969  0.0630073    0.96562   0.693081       158.712k/s        256        100             20        10k           16          110k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/24/process_time/real_time         67.0 ms         67.0 ms           10  0.0669716  0.0669819   0.972527   0.669819       149.295k/s        256        100             20        10k           32          100k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/25/process_time/real_time          119 ms          120 ms            6   0.119431   0.119446   0.989262   0.716678       83.7202k/s        256        100             20        10k           32           60k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/26/process_time/real_time         83.3 ms         83.3 ms            9  0.0832603  0.0832706   0.976267   0.749435       120.091k/s        256        100             20        10k           64           90k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/27/process_time/real_time          230 ms          230 ms            3   0.229851   0.229864   0.997017   0.689592       43.5042k/s        256        100             20        10k           64           30k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/28/process_time/real_time         6.11 ms         6.12 ms          121   6.10833m   6.11492m    0.50375   0.739905       1.63535M/s        512        100             20        10k            1         1.21M algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/29/process_time/real_time         63.0 ms         63.1 ms           11  0.0630118  0.0630202   0.965489   0.693222        158.68k/s        512        100             20        10k            1          110k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/30/process_time/real_time         8.52 ms         8.53 ms           85   8.51764m   8.52465m   0.689675   0.724595       1.17307M/s        512        100             20        10k            2          850k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/31/process_time/real_time         63.0 ms         63.0 ms           11  0.0629985  0.0630064   0.965549    0.69307       158.715k/s        512        100             20        10k            2          110k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/32/process_time/real_time         14.2 ms         14.2 ms           50  0.0142183  0.0142253   0.834733   0.711266       702.975k/s        512        100             20        10k            4          500k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/33/process_time/real_time         63.0 ms         63.1 ms           11  0.0630325    0.06304   0.965449    0.69344        158.63k/s        512        100             20        10k            4          110k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/34/process_time/real_time         29.3 ms         29.4 ms           24  0.0293341  0.0293412   0.924801   0.704188        340.82k/s        512        100             20        10k            8          240k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/35/process_time/real_time         63.0 ms         63.1 ms           11  0.0630259  0.0630417   0.965583   0.693459       158.626k/s        512        100             20        10k            8          110k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/36/process_time/real_time         63.9 ms         63.9 ms           11  0.0638791  0.0638931   0.971305   0.702824       156.512k/s        512        100             20        10k           16          110k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/37/process_time/real_time         63.0 ms         63.0 ms           11  0.0630017  0.0630159    0.96558   0.693175       158.691k/s        512        100             20        10k           16          110k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/38/process_time/real_time          112 ms          112 ms            6   0.111958   0.111974    0.98969   0.671843       89.3071k/s        512        100             20        10k           32           60k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/39/process_time/real_time          119 ms          120 ms            6   0.119477   0.119487   0.989262   0.716921       83.6917k/s        512        100             20        10k           32           60k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/40/process_time/real_time          128 ms          128 ms            6   0.128098   0.128109   0.990757   0.768653       78.0591k/s        512        100             20        10k           64           60k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/41/process_time/real_time          230 ms          230 ms            3   0.229528   0.229542      0.997   0.688625       43.5653k/s        512        100             20        10k           64           30k algo="multi_cta"
...

Updated

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                                             Time             CPU   Iterations        GPU    Latency     Recall end_to_end items_per_second      itopk          k max_iterations  n_queries search_width total_queries
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/0/process_time/real_time          3.85 ms         3.85 ms          182   3.84319m   3.84976m   0.484826   0.700656       2.59758M/s        128        100             20        10k            1         1.82M algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/1/process_time/real_time          20.0 ms         20.0 ms           35  0.0199828  0.0199901   0.782268   0.699655       500.249k/s        128        100             20        10k            1          350k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/2/process_time/real_time          7.11 ms         7.11 ms           99    7.1041m     7.111m   0.678528   0.703989       1.40628M/s        128        100             20        10k            2          990k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/3/process_time/real_time          20.0 ms         20.0 ms           35  0.0199854  0.0199927   0.782475   0.699745       500.185k/s        128        100             20        10k            2          350k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/4/process_time/real_time          14.0 ms         14.0 ms           50  0.0139898  0.0139965   0.829392   0.699823       714.469k/s        128        100             20        10k            4          500k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/5/process_time/real_time          20.0 ms         20.0 ms           35  0.0200279  0.0200356   0.782799   0.701247       499.113k/s        128        100             20        10k            4          350k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/6/process_time/real_time          27.3 ms         27.3 ms           26  0.0272676  0.0272759   0.921724   0.709173       366.626k/s        128        100             20        10k            8          260k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/7/process_time/real_time          36.8 ms         36.8 ms           19  0.0368124  0.0368207   0.907537   0.699594       271.587k/s        128        100             20        10k            8          190k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/8/process_time/real_time          37.1 ms         37.1 ms           19  0.0370707  0.0370793   0.929584   0.704507       269.693k/s        128        100             20        10k           16          190k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/9/process_time/real_time          67.2 ms         67.2 ms           10  0.0671573   0.067169   0.965452    0.67169       148.879k/s        128        100             20        10k           16          100k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/10/process_time/real_time         45.7 ms         45.8 ms           15  0.0457259   0.045735   0.938489   0.686025       218.652k/s        128        100             20        10k           32          150k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/11/process_time/real_time          126 ms          127 ms            6   0.126466   0.126479   0.989206   0.758873        79.065k/s        128        100             20        10k           32           60k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/12/process_time/real_time         66.5 ms         66.5 ms           11  0.0664834  0.0664932    0.95296   0.731425       150.392k/s        128        100             20        10k           64          110k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/13/process_time/real_time          247 ms          247 ms            3   0.246731   0.246752   0.997073   0.740257       40.5267k/s        128        100             20        10k           64           30k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/14/process_time/real_time         4.64 ms         4.65 ms          153   4.63667m   4.64393m   0.493304   0.710522       2.15336M/s        256        100             20        10k            1         1.53M algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/15/process_time/real_time         36.9 ms         36.9 ms           19  0.0369105  0.0369215   0.907588   0.701508       270.847k/s        256        100             20        10k            1          190k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/16/process_time/real_time         7.36 ms         7.37 ms           96   7.35263m   7.36353m   0.683873   0.706898       1.35805M/s        256        100             20        10k            2          960k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/17/process_time/real_time         36.9 ms         37.0 ms           19  0.0369292  0.0369429   0.907739   0.701915        270.69k/s        256        100             20        10k            2          190k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/18/process_time/real_time         13.9 ms         13.9 ms           51  0.0138446   0.013853   0.832033   0.706504       721.868k/s        256        100             20        10k            4          510k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/19/process_time/real_time         36.9 ms         36.9 ms           19  0.0368749  0.0368857   0.907508   0.700828        271.11k/s        256        100             20        10k            4          190k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/20/process_time/real_time         26.9 ms         26.9 ms           26  0.0268844  0.0269012   0.924168    0.69943       371.733k/s        256        100             20        10k            8          260k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/21/process_time/real_time         36.9 ms         36.9 ms           19  0.0368803   0.036889   0.907639   0.700892       271.085k/s        256        100             20        10k            8          190k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/22/process_time/real_time         60.2 ms         60.2 ms           12  0.0601779  0.0601869   0.969735   0.722243        166.15k/s        256        100             20        10k           16          120k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/23/process_time/real_time         67.2 ms         67.2 ms           10  0.0671682  0.0671774   0.965478   0.671774       148.861k/s        256        100             20        10k           16          100k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/24/process_time/real_time         66.5 ms         66.5 ms           11  0.0664508  0.0664627   0.972523    0.73109       150.461k/s        256        100             20        10k           32          110k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/25/process_time/real_time          126 ms          126 ms            6   0.126317    0.12633   0.989187   0.757982       79.1579k/s        256        100             20        10k           32           60k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/26/process_time/real_time         83.1 ms         83.2 ms            8  0.0831302  0.0831442   0.976267   0.665153       120.274k/s        256        100             20        10k           64           80k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/27/process_time/real_time          246 ms          246 ms            3   0.246174   0.246191   0.997044   0.738573       40.6191k/s        256        100             20        10k           64           30k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/28/process_time/real_time         5.97 ms         5.97 ms          126   5.95886m   5.96637m    0.50375   0.751763       1.67607M/s        512        100             20        10k            1         1.26M algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/29/process_time/real_time         67.1 ms         67.2 ms           10  0.0671313  0.0671415   0.965613   0.671415        148.94k/s        512        100             20        10k            1          100k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/30/process_time/real_time         8.57 ms         8.58 ms           83   8.56254m   8.57102m   0.689676   0.711395       1.16673M/s        512        100             20        10k            2          830k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/31/process_time/real_time         67.1 ms         67.2 ms           10  0.0671225  0.0671318   0.965545   0.671318       148.961k/s        512        100             20        10k            2          100k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/32/process_time/real_time         14.7 ms         14.7 ms           48  0.0146687  0.0146754   0.834734   0.704418       681.416k/s        512        100             20        10k            4          480k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/33/process_time/real_time         67.1 ms         67.2 ms           10  0.0671119  0.0671212   0.965511   0.671212       148.985k/s        512        100             20        10k            4          100k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/34/process_time/real_time         27.5 ms         27.5 ms           26  0.0274668   0.027474   0.924802   0.714325       363.982k/s        512        100             20        10k            8          260k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/35/process_time/real_time         67.1 ms         67.2 ms           10  0.0671106  0.0671196   0.965551   0.671196       148.989k/s        512        100             20        10k            8          100k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/36/process_time/real_time         62.0 ms         62.0 ms           11  0.0619862  0.0619942   0.971312   0.681936       161.306k/s        512        100             20        10k           16          110k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/37/process_time/real_time         67.1 ms         67.2 ms           10  0.0671026  0.0671108   0.965524   0.671108       149.008k/s        512        100             20        10k           16          100k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/38/process_time/real_time          111 ms          111 ms            6   0.110588   0.110598   0.989693   0.663585       90.4184k/s        512        100             20        10k           32           60k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/39/process_time/real_time          126 ms          126 ms            6    0.12628   0.126289   0.989177   0.757733       79.1838k/s        512        100             20        10k           32           60k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/40/process_time/real_time          128 ms          128 ms            5   0.127816   0.127825   0.990756   0.639124       78.2324k/s        512        100             20        10k           64           50k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/41/process_time/real_time          246 ms          246 ms            3   0.246088   0.246109   0.997026   0.738326       40.6328k/s        512        100             20        10k           64           30k algo="multi_cta"
...

@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 5, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Comment on lines 44 to 71
template <class K, class V, unsigned warp_size = 32>
struct warp_merge_core_n {
RAFT_DEVICE_INLINE_FUNCTION void operator()(
K* ks, V* vs, unsigned n, const std::uint32_t range, const bool asc)
{
const auto lane_id = threadIdx.x % warp_size;

if (range == 1) {
for (std::uint32_t b = 2; b <= N; b <<= 1) {
for (std::uint32_t b = 2; b <= n; b <<= 1) {
for (std::uint32_t c = b / 2; c >= 1; c >>= 1) {
#pragma unroll
for (std::uint32_t i = 0; i < N; i++) {
for (std::uint32_t i = 0; i < n; i++) {
std::uint32_t j = i ^ c;
if (i >= j) continue;
const auto line_id = i + (N * lane_id);
const auto line_id = i + (n * lane_id);
const auto p = static_cast<bool>(line_id & b) == static_cast<bool>(line_id & c);
swap_if_needed(k[i], v[i], k[j], v[j], p);
swap_if_needed(ks[i], vs[i], ks[j], vs[j], p);
}
}
}
return;
}

const std::uint32_t b = range;
for (std::uint32_t c = b / 2; c >= 1; c >>= 1) {
const auto p = static_cast<bool>(lane_id & b) == static_cast<bool>(lane_id & c);
#pragma unroll
for (std::uint32_t i = 0; i < N; i++) {
swap_if_needed(k[i], v[i], c, p);
}
}
const auto p = ((lane_id & b) == 0);
for (std::uint32_t c = N / 2; c >= 1; c >>= 1) {
} else {
const std::uint32_t b = range;
for (std::uint32_t c = b / 2; c >= 1; c >>= 1) {
const auto p = static_cast<bool>(lane_id & b) == static_cast<bool>(lane_id & c);
#pragma unroll
for (std::uint32_t i = 0; i < N; i++) {
std::uint32_t j = i ^ c;
if (i >= j) continue;
swap_if_needed(k[i], v[i], k[j], v[j], p);
}
}
}
};

template <class K, class V, unsigned warp_size>
struct warp_merge_core<K, V, 6, warp_size> {
RAFT_DEVICE_INLINE_FUNCTION void operator()(K k[6],
V v[6],
const std::uint32_t range,
const bool asc)
{
constexpr unsigned N = 6;
const auto lane_id = threadIdx.x % warp_size;

if (range == 1) {
for (std::uint32_t i = 0; i < N; i += 3) {
const auto p = (i == 0);
swap_if_needed(k[0 + i], v[0 + i], k[1 + i], v[1 + i], p);
swap_if_needed(k[1 + i], v[1 + i], k[2 + i], v[2 + i], p);
swap_if_needed(k[0 + i], v[0 + i], k[1 + i], v[1 + i], p);
}
const auto p = ((lane_id & 1) == 0);
for (std::uint32_t i = 0; i < 3; i++) {
std::uint32_t j = i + 3;
swap_if_needed(k[i], v[i], k[j], v[j], p);
}
for (std::uint32_t i = 0; i < N; i += 3) {
swap_if_needed(k[0 + i], v[0 + i], k[1 + i], v[1 + i], p);
swap_if_needed(k[1 + i], v[1 + i], k[2 + i], v[2 + i], p);
swap_if_needed(k[0 + i], v[0 + i], k[1 + i], v[1 + i], p);
}
return;
}

const std::uint32_t b = range;
for (std::uint32_t c = b / 2; c >= 1; c >>= 1) {
const auto p = static_cast<bool>(lane_id & b) == static_cast<bool>(lane_id & c);
#pragma unroll
for (std::uint32_t i = 0; i < N; i++) {
swap_if_needed(k[i], v[i], c, p);
for (std::uint32_t i = 0; i < n; i++) {
swap_if_needed(ks[i], vs[i], c, p);
}
Copy link
Contributor

@achirkin achirkin Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change effectively removes all of the loop unrolling, because n is not known at compile time (you can safely remove #pragra unroll as it does nothing now btw). In particular, this means the input arrays k and v cannot be passed and accessed via registers. This will likely have a huge impact on performance.
Please run a few benchmarks using ANN_BENCH first to see the impact on the throughput. From there, we can decide whether (a) performance is acceptable, (b) we need to profile the the kernel using NCU and try to improve performance, or (c) the perf state is hopeless and cannot be recovered without manual loop unrolling / restoring the template parameter.

For the benchmarks, I'd suggest the following parameter sweep:

./build.sh -n libcuvs bench-ann --limit-bench-ann=CUVS_CAGRA_ANN_BENCH
./cpp/build/bench/ann/CUVS_CAGRA_ANN_BENCH \
  --search \
  --benchmark_min_time=10s \
  --benchmark_min_warmup_time=0.001 \
  --benchmark_counters_tabular=true \
  --benchmark_out=cagra-search-`git rev-parse --abbrev-ref HEAD`.csv \
  --benchmark_out_format=csv \
  --data_prefix=<data folder> \
  --index_prefix=<index folder> \
  --override_kv=algo:\"single_cta\" \
  --override_kv=k:10:100 \
  --override_kv=itopk:32:64:128:256:512 \
  --override_kv=max_iterations:20 \
  --override_kv=n_queries:10000 \
  <config file>

@seunghwak seunghwak changed the title [WIP] Convert non-type template parameters to runtime parameters in CAGRA search to cut binary size Convert non-type template parameters to runtime parameters in CAGRA search to cut binary size Nov 18, 2025
@seunghwak seunghwak marked this pull request as ready for review November 18, 2025 17:44
@seunghwak seunghwak requested a review from a team as a code owner November 18, 2025 17:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants