Skip to content

Faster _fmpz_vec_scalar_divexact_ui #2371

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

fredrik-johansson
Copy link
Collaborator

Speed up _fmpz_vec_scalar_divexact_si, _fmpz_vec_scalar_divexact_ui and small-divisor _fmpz_vec_scalar_divexact_fmpz by precomputing an inverse mod 2^FLINT_BITS. Also special case powers of two.

This probably will not make a measurable difference on any macrobenchmarks, because the standard use case is to divide out a GCD which was more expensive to compute in the first place. However, I thought I'd put this code in as a template for doing divexact with precomputed inverses; other functions like fmpz_mat_fflu and _fmpz_poly_interpolate_exact_newton could use the same trick in the future.

Multi-limb preinverted divexact should also be done some day but it's a lot more work.

Profile code output:

$ build/fmpz_vec/profile/p-divexact_ui 
   len    bits(A*c)  bits(c)      old       new    speedup
     1         2        2    2.77e-09   3.67e-09   0.755
     1         0       46    2.81e-09   3.46e-09   0.812
     1        66       51    2.61e-08   2.66e-08   0.981
     1         0       33    2.83e-09   3.47e-09   0.816
     1        79       44    2.34e-08   2.41e-08   0.971
     1       112       36       1e-08   1.06e-08   0.943
     1         0       44    2.83e-09   3.48e-09   0.813
     1      -518       34    1.88e-08   1.88e-08   1.000
     1       375       43    1.36e-08   1.36e-08   1.000
     1       929       17    2.64e-08   2.66e-08   0.992
     1      -114       54    2.37e-08   2.43e-08   0.975
     2         0       45    5.71e-09   7.27e-09   0.785
     2       -64       55    2.51e-08   7.06e-09   3.555
     2       -46       30     5.7e-09   7.24e-09   0.787
     2       -59       22    5.85e-09   6.64e-09   0.881
     2        45       17    5.99e-09   7.32e-09   0.818
     2      -170       63    2.06e-08   1.67e-08   1.234
     2         0       31    5.78e-09    6.6e-09   0.876
     2       564       18    2.08e-08    2.1e-08   0.990
     2       817       22    4.74e-08   4.54e-08   1.044
     2     -2270       43    1.21e-07   1.24e-07   0.976
     2     -4587       17    2.38e-07   2.42e-07   0.983
     3        58       54    8.15e-09   8.68e-09   0.939
     3       -58       56    8.18e-09   7.46e-09   1.097
     3       -28       13    8.15e-09   8.66e-09   0.941
     3       -39        7    8.12e-09   7.44e-09   1.091
     3      -131       55    5.78e-08   1.64e-08   3.524
     3      -133       21    2.25e-08   1.47e-08   1.531
     3      -262       23    3.42e-08   2.85e-08   1.200
     3      -497       34    4.78e-08   3.71e-08   1.288
     3     -1095       41    7.62e-08   6.96e-08   1.095
     3      2392       52    1.63e-07   1.62e-07   1.006
     3     -2164       30    6.37e-08   6.78e-08   0.940
     4       -31       27    1.08e-08   9.72e-09   1.111
     4       -27       20    1.07e-08   8.64e-09   1.238
     4       -64       64    6.98e-08   1.03e-08   6.777
     4       -91       53    9.38e-08    1.4e-08   6.700
     4       -81        4    2.51e-08   2.36e-08   1.064
     4      -165       64    3.24e-08    2.3e-08   1.409
     4      -365       56    5.74e-08   3.15e-08   1.822
     4      -488       47    4.56e-08   3.53e-08   1.292
     4     -1248       60    1.04e-07   7.02e-08   1.481
     4     -2282       30    2.04e-07   1.98e-07   1.030
     4      -234        7    1.86e-08   1.41e-08   1.319
     5        54       54    1.35e-08   9.52e-09   1.418
     5        32       24    1.36e-08   1.14e-08   1.193
     5        53       53    1.35e-08   1.14e-08   1.184
     5         0       29    1.35e-08   1.14e-08   1.184
     5      -135       64    8.81e-08   2.24e-08   3.933
     5         0       60    1.39e-08   9.54e-09   1.457
     5      -167       13    2.06e-08   1.38e-08   1.493
     5      -685       51    8.26e-08   6.32e-08   1.307
     5     -1235       27    9.26e-08   7.55e-08   1.226
     5     -2277        5    1.95e-07   1.89e-07   1.032
     5     -4833       22     3.6e-07   3.58e-07   1.006
     6        -6        2    1.61e-08   7.21e-09   2.233
     6       -32       22    1.61e-08   1.27e-08   1.268
     6       -82       62    9.71e-08   1.52e-08   6.388
     6        46       30    1.61e-08   1.07e-08   1.505
     6        75       14    3.64e-08   1.17e-08   3.111
     6      -189       37    4.66e-08   2.98e-08   1.564
     6       342       37    2.59e-08   2.12e-08   1.222
     6         0       41    1.61e-08   1.07e-08   1.505
     6     -1172        8    1.57e-07   1.36e-07   1.154
     6      2384       12    1.36e-07   1.37e-07   0.993
     6     -5083       53    6.66e-07   6.66e-07   1.000
     7       -33       28    1.87e-08   1.41e-08   1.326
     7         0       20    1.87e-08    1.4e-08   1.336
     7       -80       62    1.22e-07   1.75e-08   6.971
     7        69       56     3.9e-08   1.19e-08   3.277
     7      -105       32    5.39e-08   1.97e-08   2.736
     7         0       37    1.87e-08   1.18e-08   1.585
     7      -287       20     2.7e-08   1.86e-08   1.452
     7      -458       43     7.3e-08   5.17e-08   1.412
     7      -957       46    1.06e-07   8.26e-08   1.283
     7     -2357       58    2.86e-07   2.77e-07   1.032
     7     -4344       40    4.56e-07   4.45e-07   1.025
     8         0        9    2.14e-08   1.28e-08   1.672
     8        -4        4    2.14e-08   1.28e-08   1.672
     8       -78       59    1.24e-07    1.7e-08   7.294
     8       -74       37    8.35e-08   1.78e-08   4.691
     8      -116       36    1.13e-07    2.8e-08   4.036
     8      -191       63    3.68e-08   2.44e-08   1.508
     8      -271       36    3.85e-08   2.92e-08   1.318
     8      -398       23    9.22e-08    4.9e-08   1.882
     8     -1115        8    1.61e-07   1.25e-07   1.288
     8     -1371       46     1.1e-07   9.68e-08   1.136
     8     -4883       46     3.9e-07   3.69e-07   1.057
     9         0        4    2.41e-08   1.59e-08   1.516
     9       -42       33    2.41e-08    1.6e-08   1.506
     9       -45       29     2.4e-08   1.59e-08   1.509
     9        53       39     2.4e-08   1.35e-08   1.778
     9      -129       53    1.13e-07   3.16e-08   3.576
     9      -159        1    7.18e-08   5.42e-08   1.325
     9      -326       60     6.1e-08   4.38e-08   1.393
     9         0       62     2.4e-08   1.35e-08   1.778
     9     -1212       62    2.03e-07   1.62e-07   1.253
     9         0       28     2.4e-08   1.59e-08   1.509
     9     -3064       16    1.88e-07   1.75e-07   1.074
    10       -21       18    2.67e-08   1.44e-08   1.854
    10       -10        2    2.67e-08   1.43e-08   1.867
    10       -22        2    2.66e-08   1.44e-08   1.847
    10       -40       23    2.68e-08   1.43e-08   1.874
    10      -102       22    1.14e-07   2.03e-08   5.616
    10       178       64    3.45e-08   2.33e-08   1.481
    10      -299       36    6.36e-08   2.92e-08   2.178
    10       467       49    3.68e-08   2.59e-08   1.421
    10      -855       43    4.86e-08   3.99e-08   1.218
    10     -2122       63    1.22e-07   1.14e-07   1.070
    10     -4740       15    7.86e-07   1.45e-07   5.421

@fredrik-johansson
Copy link
Collaborator Author

fredrik-johansson commented Jul 19, 2025

Would be good to see how this compares on Intel and Apple hardware before merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants