Convert non-type template parameters to runtime parameters in CAGRA search to cut binary size #1498

seunghwak · 2025-11-05T16:39:14Z

This PR converts

MAX_ITOPK & MAX_CANDIDATES in single-CTA search and
MAX_ELEMENTS in multi-CTA search to runtime parameters.

Cut libcuvs.so size (CUDA 13, build.sh --allgpuarch -n libcuvs) from 459 MB to 350 MB.

For correctness testing,
ran

NEIGHBORS_ANN_CAGRA_FLOAT_UTIN32_TEST
NEIGHBORS_ANN_CAGRA_HALF_UTIN32_TEST
NEIGHBORS_ANN_CAGRA_UINT8_UTIN32_TEST
NEIGHBORS_ANN_CAGRA_INT8_UTIN32_TEST
NEIGHBORS_ANN_CAGRA_HELPER_TEST
NEIGHBORS_ANN_CAGRA_TEST_BUGS

To evaluate performance impact,
ran cuvs_bench with batch sizes (10, 100, 1000, 10000) and k (10 and 100) (default options for other parameters) and deep-image-96-inner

python -m cuvs_bench.run --dataset deep-image-96-inner --algorithms cuvs_cagra --batch_size 10|100|1000|10000 - k 10|100

for both single-CTA and multi-CTA searches.

Performance impacts varies based on MAX_ITOPK and MAX_CANDIDATES combinations but performance numbers were roughly comparable (slightly slower in average with the maximum slowdown around 10%).

Let me know if there are other benchmarks I need to run to test performance.

Some performance logs from the original code and the updated code for anyone interested (batch size = 10K, K=100).

Original (main)

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                                             Time             CPU   Iterations        GPU    Latency     Recall end_to_end items_per_second      itopk          k max_iterations  n_queries search_width total_queries
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/0/process_time/real_time          3.62 ms         3.62 ms          194   3.61323m   3.62005m   0.484826   0.702289       2.76241M/s        128        100             20        10k            1         1.94M algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/1/process_time/real_time          19.1 ms         19.1 ms           37  0.0190608  0.0190676   0.782904   0.705502       524.452k/s        128        100             20        10k            1          370k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/2/process_time/real_time          6.80 ms         6.80 ms          103   6.79127m   6.79774m   0.678528   0.700167       1.47108M/s        128        100             20        10k            2         1030k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/3/process_time/real_time          19.1 ms         19.1 ms           37  0.0190677  0.0190756    0.78237   0.705797       524.234k/s        128        100             20        10k            2          370k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/4/process_time/real_time          13.5 ms         13.5 ms           52  0.0134807  0.0134903   0.829391   0.701498       741.275k/s        128        100             20        10k            4          520k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/5/process_time/real_time          19.1 ms         19.1 ms           37  0.0190805  0.0190876   0.782356   0.706242       523.903k/s        128        100             20        10k            4          370k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/6/process_time/real_time          28.5 ms         28.5 ms           25  0.0284935  0.0285032   0.921725    0.71258        350.84k/s        128        100             20        10k            8          250k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/7/process_time/real_time          34.9 ms         34.9 ms           20  0.0348447  0.0348594   0.907621   0.697188       286.869k/s        128        100             20        10k            8          200k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/8/process_time/real_time          36.2 ms         36.2 ms           19   0.036145  0.0361537   0.929581   0.686921       276.598k/s        128        100             20        10k           16          190k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/9/process_time/real_time          62.9 ms         63.0 ms           11  0.0629066  0.0629149   0.965499   0.692063       158.946k/s        128        100             20        10k           16          110k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/10/process_time/real_time         44.5 ms         44.6 ms           16  0.0445286  0.0445393   0.938481   0.712628       224.522k/s        128        100             20        10k           32          160k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/11/process_time/real_time          119 ms          119 ms            6   0.119414   0.119426   0.989163   0.716554       83.7345k/s        128        100             20        10k           32           60k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/12/process_time/real_time         65.4 ms         65.5 ms           11  0.0654209  0.0654292   0.952963   0.719721       152.838k/s        128        100             20        10k           64          110k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/13/process_time/real_time          230 ms          230 ms            3   0.229596   0.229609   0.997039   0.688826       43.5526k/s        128        100             20        10k           64           30k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/14/process_time/real_time         4.47 ms         4.48 ms          160   4.46723m   4.47379m   0.493304   0.715806       2.23525M/s        256        100             20        10k            1          1.6M algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/15/process_time/real_time         35.0 ms         35.0 ms           20  0.0349487  0.0349571   0.907623   0.699142       286.066k/s        256        100             20        10k            1          200k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/16/process_time/real_time         7.21 ms         7.21 ms           99   7.20385m   7.21088m   0.683873   0.713877        1.3868M/s        256        100             20        10k            2          990k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/17/process_time/real_time         34.9 ms         35.0 ms           20  0.0349346   0.034943   0.907618    0.69886       286.182k/s        256        100             20        10k            2          200k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/18/process_time/real_time         13.5 ms         13.5 ms           52   0.013497  0.0135039   0.832033   0.702202       740.531k/s        256        100             20        10k            4          520k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/19/process_time/real_time         34.9 ms         34.9 ms           20  0.0348956  0.0349025    0.90743    0.69805       286.514k/s        256        100             20        10k            4          200k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/20/process_time/real_time         27.6 ms         27.6 ms           25   0.027547  0.0275539    0.92417   0.688848       362.926k/s        256        100             20        10k            8          250k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/21/process_time/real_time         34.9 ms         35.0 ms           20  0.0349296  0.0349367   0.907559   0.698733       286.234k/s        256        100             20        10k            8          200k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/22/process_time/real_time         62.5 ms         62.5 ms           12  0.0624942  0.0625044   0.969736   0.750053        159.99k/s        256        100             20        10k           16          120k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/23/process_time/real_time         63.0 ms         63.0 ms           11  0.0629969  0.0630073    0.96562   0.693081       158.712k/s        256        100             20        10k           16          110k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/24/process_time/real_time         67.0 ms         67.0 ms           10  0.0669716  0.0669819   0.972527   0.669819       149.295k/s        256        100             20        10k           32          100k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/25/process_time/real_time          119 ms          120 ms            6   0.119431   0.119446   0.989262   0.716678       83.7202k/s        256        100             20        10k           32           60k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/26/process_time/real_time         83.3 ms         83.3 ms            9  0.0832603  0.0832706   0.976267   0.749435       120.091k/s        256        100             20        10k           64           90k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/27/process_time/real_time          230 ms          230 ms            3   0.229851   0.229864   0.997017   0.689592       43.5042k/s        256        100             20        10k           64           30k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/28/process_time/real_time         6.11 ms         6.12 ms          121   6.10833m   6.11492m    0.50375   0.739905       1.63535M/s        512        100             20        10k            1         1.21M algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/29/process_time/real_time         63.0 ms         63.1 ms           11  0.0630118  0.0630202   0.965489   0.693222        158.68k/s        512        100             20        10k            1          110k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/30/process_time/real_time         8.52 ms         8.53 ms           85   8.51764m   8.52465m   0.689675   0.724595       1.17307M/s        512        100             20        10k            2          850k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/31/process_time/real_time         63.0 ms         63.0 ms           11  0.0629985  0.0630064   0.965549    0.69307       158.715k/s        512        100             20        10k            2          110k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/32/process_time/real_time         14.2 ms         14.2 ms           50  0.0142183  0.0142253   0.834733   0.711266       702.975k/s        512        100             20        10k            4          500k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/33/process_time/real_time         63.0 ms         63.1 ms           11  0.0630325    0.06304   0.965449    0.69344        158.63k/s        512        100             20        10k            4          110k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/34/process_time/real_time         29.3 ms         29.4 ms           24  0.0293341  0.0293412   0.924801   0.704188        340.82k/s        512        100             20        10k            8          240k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/35/process_time/real_time         63.0 ms         63.1 ms           11  0.0630259  0.0630417   0.965583   0.693459       158.626k/s        512        100             20        10k            8          110k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/36/process_time/real_time         63.9 ms         63.9 ms           11  0.0638791  0.0638931   0.971305   0.702824       156.512k/s        512        100             20        10k           16          110k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/37/process_time/real_time         63.0 ms         63.0 ms           11  0.0630017  0.0630159    0.96558   0.693175       158.691k/s        512        100             20        10k           16          110k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/38/process_time/real_time          112 ms          112 ms            6   0.111958   0.111974    0.98969   0.671843       89.3071k/s        512        100             20        10k           32           60k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/39/process_time/real_time          119 ms          120 ms            6   0.119477   0.119487   0.989262   0.716921       83.6917k/s        512        100             20        10k           32           60k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/40/process_time/real_time          128 ms          128 ms            6   0.128098   0.128109   0.990757   0.768653       78.0591k/s        512        100             20        10k           64           60k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/41/process_time/real_time          230 ms          230 ms            3   0.229528   0.229542      0.997   0.688625       43.5653k/s        512        100             20        10k           64           30k algo="multi_cta"
...

Updated

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                                             Time             CPU   Iterations        GPU    Latency     Recall end_to_end items_per_second      itopk          k max_iterations  n_queries search_width total_queries
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/0/process_time/real_time          3.85 ms         3.85 ms          182   3.84319m   3.84976m   0.484826   0.700656       2.59758M/s        128        100             20        10k            1         1.82M algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/1/process_time/real_time          20.0 ms         20.0 ms           35  0.0199828  0.0199901   0.782268   0.699655       500.249k/s        128        100             20        10k            1          350k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/2/process_time/real_time          7.11 ms         7.11 ms           99    7.1041m     7.111m   0.678528   0.703989       1.40628M/s        128        100             20        10k            2          990k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/3/process_time/real_time          20.0 ms         20.0 ms           35  0.0199854  0.0199927   0.782475   0.699745       500.185k/s        128        100             20        10k            2          350k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/4/process_time/real_time          14.0 ms         14.0 ms           50  0.0139898  0.0139965   0.829392   0.699823       714.469k/s        128        100             20        10k            4          500k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/5/process_time/real_time          20.0 ms         20.0 ms           35  0.0200279  0.0200356   0.782799   0.701247       499.113k/s        128        100             20        10k            4          350k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/6/process_time/real_time          27.3 ms         27.3 ms           26  0.0272676  0.0272759   0.921724   0.709173       366.626k/s        128        100             20        10k            8          260k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/7/process_time/real_time          36.8 ms         36.8 ms           19  0.0368124  0.0368207   0.907537   0.699594       271.587k/s        128        100             20        10k            8          190k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/8/process_time/real_time          37.1 ms         37.1 ms           19  0.0370707  0.0370793   0.929584   0.704507       269.693k/s        128        100             20        10k           16          190k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/9/process_time/real_time          67.2 ms         67.2 ms           10  0.0671573   0.067169   0.965452    0.67169       148.879k/s        128        100             20        10k           16          100k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/10/process_time/real_time         45.7 ms         45.8 ms           15  0.0457259   0.045735   0.938489   0.686025       218.652k/s        128        100             20        10k           32          150k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/11/process_time/real_time          126 ms          127 ms            6   0.126466   0.126479   0.989206   0.758873        79.065k/s        128        100             20        10k           32           60k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/12/process_time/real_time         66.5 ms         66.5 ms           11  0.0664834  0.0664932    0.95296   0.731425       150.392k/s        128        100             20        10k           64          110k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/13/process_time/real_time          247 ms          247 ms            3   0.246731   0.246752   0.997073   0.740257       40.5267k/s        128        100             20        10k           64           30k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/14/process_time/real_time         4.64 ms         4.65 ms          153   4.63667m   4.64393m   0.493304   0.710522       2.15336M/s        256        100             20        10k            1         1.53M algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/15/process_time/real_time         36.9 ms         36.9 ms           19  0.0369105  0.0369215   0.907588   0.701508       270.847k/s        256        100             20        10k            1          190k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/16/process_time/real_time         7.36 ms         7.37 ms           96   7.35263m   7.36353m   0.683873   0.706898       1.35805M/s        256        100             20        10k            2          960k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/17/process_time/real_time         36.9 ms         37.0 ms           19  0.0369292  0.0369429   0.907739   0.701915        270.69k/s        256        100             20        10k            2          190k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/18/process_time/real_time         13.9 ms         13.9 ms           51  0.0138446   0.013853   0.832033   0.706504       721.868k/s        256        100             20        10k            4          510k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/19/process_time/real_time         36.9 ms         36.9 ms           19  0.0368749  0.0368857   0.907508   0.700828        271.11k/s        256        100             20        10k            4          190k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/20/process_time/real_time         26.9 ms         26.9 ms           26  0.0268844  0.0269012   0.924168    0.69943       371.733k/s        256        100             20        10k            8          260k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/21/process_time/real_time         36.9 ms         36.9 ms           19  0.0368803   0.036889   0.907639   0.700892       271.085k/s        256        100             20        10k            8          190k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/22/process_time/real_time         60.2 ms         60.2 ms           12  0.0601779  0.0601869   0.969735   0.722243        166.15k/s        256        100             20        10k           16          120k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/23/process_time/real_time         67.2 ms         67.2 ms           10  0.0671682  0.0671774   0.965478   0.671774       148.861k/s        256        100             20        10k           16          100k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/24/process_time/real_time         66.5 ms         66.5 ms           11  0.0664508  0.0664627   0.972523    0.73109       150.461k/s        256        100             20        10k           32          110k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/25/process_time/real_time          126 ms          126 ms            6   0.126317    0.12633   0.989187   0.757982       79.1579k/s        256        100             20        10k           32           60k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/26/process_time/real_time         83.1 ms         83.2 ms            8  0.0831302  0.0831442   0.976267   0.665153       120.274k/s        256        100             20        10k           64           80k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/27/process_time/real_time          246 ms          246 ms            3   0.246174   0.246191   0.997044   0.738573       40.6191k/s        256        100             20        10k           64           30k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/28/process_time/real_time         5.97 ms         5.97 ms          126   5.95886m   5.96637m    0.50375   0.751763       1.67607M/s        512        100             20        10k            1         1.26M algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/29/process_time/real_time         67.1 ms         67.2 ms           10  0.0671313  0.0671415   0.965613   0.671415        148.94k/s        512        100             20        10k            1          100k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/30/process_time/real_time         8.57 ms         8.58 ms           83   8.56254m   8.57102m   0.689676   0.711395       1.16673M/s        512        100             20        10k            2          830k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/31/process_time/real_time         67.1 ms         67.2 ms           10  0.0671225  0.0671318   0.965545   0.671318       148.961k/s        512        100             20        10k            2          100k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/32/process_time/real_time         14.7 ms         14.7 ms           48  0.0146687  0.0146754   0.834734   0.704418       681.416k/s        512        100             20        10k            4          480k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/33/process_time/real_time         67.1 ms         67.2 ms           10  0.0671119  0.0671212   0.965511   0.671212       148.985k/s        512        100             20        10k            4          100k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/34/process_time/real_time         27.5 ms         27.5 ms           26  0.0274668   0.027474   0.924802   0.714325       363.982k/s        512        100             20        10k            8          260k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/35/process_time/real_time         67.1 ms         67.2 ms           10  0.0671106  0.0671196   0.965551   0.671196       148.989k/s        512        100             20        10k            8          100k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/36/process_time/real_time         62.0 ms         62.0 ms           11  0.0619862  0.0619942   0.971312   0.681936       161.306k/s        512        100             20        10k           16          110k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/37/process_time/real_time         67.1 ms         67.2 ms           10  0.0671026  0.0671108   0.965524   0.671108       149.008k/s        512        100             20        10k           16          100k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/38/process_time/real_time          111 ms          111 ms            6   0.110588   0.110598   0.989693   0.663585       90.4184k/s        512        100             20        10k           32           60k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/39/process_time/real_time          126 ms          126 ms            6    0.12628   0.126289   0.989177   0.757733       79.1838k/s        512        100             20        10k           32           60k algo="multi_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/40/process_time/real_time          128 ms          128 ms            5   0.127816   0.127825   0.990756   0.639124       78.2324k/s        512        100             20        10k           64           50k algo="single_cta"
cuvs_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/41/process_time/real_time          246 ms          246 ms            3   0.246088   0.246109   0.997026   0.738326       40.6328k/s        512        100             20        10k           64           30k algo="multi_cta"
...

copy-pr-bot · 2025-11-05T16:39:18Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

…ch_core

achirkin · 2025-11-07T07:26:00Z

cpp/src/neighbors/detail/cagra/bitonic.hpp

+template <class K, class V, unsigned warp_size = 32>
+struct warp_merge_core_n {
+  RAFT_DEVICE_INLINE_FUNCTION void operator()(
+    K* ks, V* vs, unsigned n, const std::uint32_t range, const bool asc)
  {
    const auto lane_id = threadIdx.x % warp_size;

    if (range == 1) {
-      for (std::uint32_t b = 2; b <= N; b <<= 1) {
+      for (std::uint32_t b = 2; b <= n; b <<= 1) {
        for (std::uint32_t c = b / 2; c >= 1; c >>= 1) {
 #pragma unroll
-          for (std::uint32_t i = 0; i < N; i++) {
+          for (std::uint32_t i = 0; i < n; i++) {
            std::uint32_t j = i ^ c;
            if (i >= j) continue;
-            const auto line_id = i + (N * lane_id);
+            const auto line_id = i + (n * lane_id);
            const auto p       = static_cast<bool>(line_id & b) == static_cast<bool>(line_id & c);
-            swap_if_needed(k[i], v[i], k[j], v[j], p);
+            swap_if_needed(ks[i], vs[i], ks[j], vs[j], p);
          }
        }
      }
-      return;
-    }
-
-    const std::uint32_t b = range;
-    for (std::uint32_t c = b / 2; c >= 1; c >>= 1) {
-      const auto p = static_cast<bool>(lane_id & b) == static_cast<bool>(lane_id & c);
-#pragma unroll
-      for (std::uint32_t i = 0; i < N; i++) {
-        swap_if_needed(k[i], v[i], c, p);
-      }
-    }
-    const auto p = ((lane_id & b) == 0);
-    for (std::uint32_t c = N / 2; c >= 1; c >>= 1) {
+    } else {
+      const std::uint32_t b = range;
+      for (std::uint32_t c = b / 2; c >= 1; c >>= 1) {
+        const auto p = static_cast<bool>(lane_id & b) == static_cast<bool>(lane_id & c);
 #pragma unroll
-      for (std::uint32_t i = 0; i < N; i++) {
-        std::uint32_t j = i ^ c;
-        if (i >= j) continue;
-        swap_if_needed(k[i], v[i], k[j], v[j], p);
-      }
-    }
-  }
-};
-
-template <class K, class V, unsigned warp_size>
-struct warp_merge_core<K, V, 6, warp_size> {
-  RAFT_DEVICE_INLINE_FUNCTION void operator()(K k[6],
-                                              V v[6],
-                                              const std::uint32_t range,
-                                              const bool asc)
-  {
-    constexpr unsigned N = 6;
-    const auto lane_id   = threadIdx.x % warp_size;
-
-    if (range == 1) {
-      for (std::uint32_t i = 0; i < N; i += 3) {
-        const auto p = (i == 0);
-        swap_if_needed(k[0 + i], v[0 + i], k[1 + i], v[1 + i], p);
-        swap_if_needed(k[1 + i], v[1 + i], k[2 + i], v[2 + i], p);
-        swap_if_needed(k[0 + i], v[0 + i], k[1 + i], v[1 + i], p);
-      }
-      const auto p = ((lane_id & 1) == 0);
-      for (std::uint32_t i = 0; i < 3; i++) {
-        std::uint32_t j = i + 3;
-        swap_if_needed(k[i], v[i], k[j], v[j], p);
-      }
-      for (std::uint32_t i = 0; i < N; i += 3) {
-        swap_if_needed(k[0 + i], v[0 + i], k[1 + i], v[1 + i], p);
-        swap_if_needed(k[1 + i], v[1 + i], k[2 + i], v[2 + i], p);
-        swap_if_needed(k[0 + i], v[0 + i], k[1 + i], v[1 + i], p);
-      }
-      return;
-    }
-
-    const std::uint32_t b = range;
-    for (std::uint32_t c = b / 2; c >= 1; c >>= 1) {
-      const auto p = static_cast<bool>(lane_id & b) == static_cast<bool>(lane_id & c);
-#pragma unroll
-      for (std::uint32_t i = 0; i < N; i++) {
-        swap_if_needed(k[i], v[i], c, p);
+        for (std::uint32_t i = 0; i < n; i++) {
+          swap_if_needed(ks[i], vs[i], c, p);
+        }


This change effectively removes all of the loop unrolling, because n is not known at compile time (you can safely remove #pragra unroll as it does nothing now btw). In particular, this means the input arrays k and v cannot be passed and accessed via registers. This will likely have a huge impact on performance.
Please run a few benchmarks using ANN_BENCH first to see the impact on the throughput. From there, we can decide whether (a) performance is acceptable, (b) we need to profile the the kernel using NCU and try to improve performance, or (c) the perf state is hopeless and cannot be recovered without manual loop unrolling / restoring the template parameter.

For the benchmarks, I'd suggest the following parameter sweep:

./build.sh -n libcuvs bench-ann --limit-bench-ann=CUVS_CAGRA_ANN_BENCH ./cpp/build/bench/ann/CUVS_CAGRA_ANN_BENCH \ --search \ --benchmark_min_time=10s \ --benchmark_min_warmup_time=0.001 \ --benchmark_counters_tabular=true \ --benchmark_out=cagra-search-`git rev-parse --abbrev-ref HEAD`.csv \ --benchmark_out_format=csv \ --data_prefix=<data folder> \ --index_prefix=<index folder> \ --override_kv=algo:\"single_cta\" \ --override_kv=k:10:100 \ --override_kv=itopk:32:64:128:256:512 \ --override_kv=max_iterations:20 \ --override_kv=n_queries:10000 \ <config file>

…ch_core

… on successive warp_merge calls when N is large, we now handle run-time branch in a higher level

…c sort key, value pairs

…ch the original code

…re of large N path negatively impacting the performance of small N cases

…ch_core

…own (16-17%) max_itopk=512 with batch size 10000 cases

…ch_core

seunghwak added 3 commits November 4, 2025 18:29

covnert max_topk to a runtime parameter

0e9f61c

convert max_candidates to a runtime parameter

444b946

convert max_elements to a runtime parameter

54a1805

github-project-automation bot added this to Vector Search, ML, & Data Mining Release Board Nov 5, 2025

github-project-automation bot moved this to Todo in Vector Search, ML, & Data Mining Release Board Nov 5, 2025

seunghwak added 2 commits November 5, 2025 12:27

Merge branch 'main' of https://github.com/rapidsai/cuvs into enh_sear…

6fc87a2

…ch_core

Merge branch 'main' of https://github.com/rapidsai/cuvs into enh_sear…

25a0b8d

…ch_core

achirkin requested changes Nov 7, 2025

View reviewed changes

seunghwak added 20 commits November 7, 2025 17:03

Merge branch 'main' of https://github.com/rapidsai/cuvs into enh_sear…

a13112f

…ch_core

tighter bound on array size

aa5c116

undo most of the changes in bitonic.hpp except for the less unrolling…

178459a

… on successive warp_merge calls when N is large, we now handle run-time branch in a higher level

remove __inline__ from topk_by_radix (to lower register pressure)

6aae61d

branch outside topk_by_bitonic_sort

8aaf11b

use shared memory when I need to create large stack arrays for bitoni…

0735129

…c sort key, value pairs

branch before calling topky_by_bitonic_sort_and_merge

3c9ee5f

undo changes in topk_cta_11_core (branch in the caller site)

68b191b

fix build error

89d9698

update max_itopk setting in single-CTA radix sort based search to mat…

a9bfab7

…ch the original code

create non-template wrapper functions to prevent high register pressu…

a0be265

…re of large N path negatively impacting the performance of small N cases

remove unnecessary include statements

52141c3

use smem to reduce register pressure

15bdc84

Merge branch 'main' of https://github.com/rapidsai/cuvs into enh_sear…

c41d548

…ch_core

Merge branch 'main' of https://github.com/rapidsai/cuvs into enh_sear…

2a8eac0

…ch_core

fix build error after pulling new updates

5a117a5

undo using shared memory to store key, value pairs

4cd6f8f

delete dead code

c8a9673

fix an error

77d657c

tweak register pressure

efcf489

seunghwak changed the title ~~[WIP] Convert non-type template parameters to runtime parameters in CAGRA search to cut binary size~~ Convert non-type template parameters to runtime parameters in CAGRA search to cut binary size Nov 18, 2025

Merge branch 'main' of https://github.com/rapidsai/cuvs into enh_sear…

55b80f4

…ch_core

seunghwak marked this pull request as ready for review November 18, 2025 17:44

seunghwak requested a review from a team as a code owner November 18, 2025 17:44

seunghwak added 5 commits November 18, 2025 10:49

copyright year

bde5ac5

Merge branch 'main' of https://github.com/rapidsai/cuvs into enh_sear…

49c1c64

…ch_core

undo conditional unrolling in bitonic.hpp (this significantly slows d…

901bad6

…own (16-17%) max_itopk=512 with batch size 10000 cases

final performance tweak

4f7ef48

Merge branch 'main' of https://github.com/rapidsai/cuvs into enh_sear…

a596c9f

…ch_core

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Convert non-type template parameters to runtime parameters in CAGRA search to cut binary size #1498

Convert non-type template parameters to runtime parameters in CAGRA search to cut binary size #1498

Uh oh!

seunghwak commented Nov 5, 2025 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Nov 5, 2025

Uh oh!

achirkin Nov 7, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Convert non-type template parameters to runtime parameters in CAGRA search to cut binary size #1498

Are you sure you want to change the base?

Convert non-type template parameters to runtime parameters in CAGRA search to cut binary size #1498

Uh oh!

Conversation

seunghwak commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot bot commented Nov 5, 2025

Uh oh!

achirkin Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

seunghwak commented Nov 5, 2025 •

edited

Loading

achirkin Nov 7, 2025 •

edited

Loading