-
Notifications
You must be signed in to change notification settings - Fork 267
Description
In #2371 fmpz_vec
there is code for nx1-limb divexact with precomputed inverse which could be moved to ulong_extras
and mpn_extras
.
For multilimb divexact, one could be lazy and use flint_mpn_divrem_preinvn
and it turns out this beats mpn_divexact
above roughly 1000 limbs. Even better is to do a single flint_mpn_mulhigh_n
or flint_mpn_mullow_n
and exploit the fact that the quotient is known to be exact. Better yet is the bidirectional algorithm: compute the high n/2
limbs using mulhigh and the low n/2
limbs using mullow (one might need n/2+1
limbs to glue things together correctly, not sure).
Quick benchmark, timings relative to mpn_divexact
(lower is better):
n divrem_preinvn mullow bidirectional
2 1.562 0.102 0.182
4 1.469 0.151 0.151
8 1.281 0.307 0.191
16 1.747 0.618 0.285
32 1.943 0.824 0.429
64 1.867 0.871 0.483
128 1.483 0.716 0.446
256 1.350 0.663 0.421
512 1.256 0.612 0.404
1024 0.761 0.376 0.389
2048 0.634 0.309 0.283
4096 0.528 0.252 0.243
8192 0.450 0.217 0.210
16384 0.410 0.202 0.193
32768 0.384 0.190 0.185
65536 0.339 0.167 0.161
131072 0.350 0.171 0.161
262144 0.298 0.145 0.134
524288 0.268 0.132 0.126
1048576 0.276 0.130 0.118
2097152 0.282 0.139 0.118
4194304 0.267 0.130 0.119
8388608 0.273 0.135 0.124
16777216 0.287 0.144 0.127
Of course, that is just the basic idea; there will be some more overhead for adjustments, and it gets a bit more tricky to do all unbalanced cases optimally.
Once some divexact_preinv
algorithms are implemented, one could also implement a divexact
which computes the inverse on the fly (with different tuning since this will also have to account for the cost of computing different inverses).
Benchmark code:
#include <gmp.h>
#include "mpn_extras.h"
#include "profiler.h"
void __gmpn_divexact(mp_ptr, mp_srcptr, mp_size_t, mp_srcptr, mp_size_t);
int main()
{
nn_ptr C, D, A, B, AB, Binv;
slong n;
double __, tgmp, tinv, tpreinvn, tmullow, tbidirectional;
flint_rand_t state;
flint_rand_init(state);
flint_printf(" n divrem_preinvn mullow bidirectional\n");
for (n = 2; ; n *= 2)
{
A = flint_malloc(n * sizeof(mp_limb_t));
B = flint_malloc(n * sizeof(mp_limb_t));
AB = flint_malloc(2 * n * sizeof(mp_limb_t));
C = flint_malloc(2 * n * sizeof(mp_limb_t));
D = flint_malloc(2 * n * sizeof(mp_limb_t));
Binv = flint_malloc(n * sizeof(mp_limb_t));
flint_mpn_urandomb(A, state, n * FLINT_BITS);
flint_mpn_urandomb(B, state, n * FLINT_BITS);
A[n - 1] |= UWORD(3) << (FLINT_BITS - 2);
B[n - 1] |= UWORD(3) << (FLINT_BITS - 2);
A[0] |= 1;
B[0] |= 1;
flint_mpn_mul_n(AB, A, B, n);
TIMEIT_START
__gmpn_divexact(C, AB, 2 * n, B, n);
TIMEIT_STOP_VALUES(__, tgmp)
flint_mpn_preinvn(Binv, B, n);
TIMEIT_START
flint_mpn_divrem_preinvn(C, D, AB, 2 * n, B, n, Binv);
TIMEIT_STOP_VALUES(__, tpreinvn)
TIMEIT_START
flint_mpn_mullow_n(C, AB, B, n);
TIMEIT_STOP_VALUES(__, tmullow)
TIMEIT_START
flint_mpn_mullow_n(C, AB, B, n / 2);
flint_mpn_mulhigh_n(C, AB, B + n / 2, n / 2);
TIMEIT_STOP_VALUES(__, tbidirectional)
flint_printf("%10wd %.3f %.3f %.3f\n", n, tpreinvn / tgmp, tmullow / tgmp, tbidirectional / tgmp);
flint_free(A);
flint_free(B);
flint_free(AB);
flint_free(C);
flint_free(D);
flint_free(Binv);
}
}