AVX IFMA Accelerated `split_mul`

I prototyped a portable `split_mul` implementation premised on using the AVX IFMA instructions to accelerate itself. I believe I optimized it decently, the code is well-documented, the benchmark is easy to run, and the README explains the challenges moving forward. Unfortunately, despite observing a 20-40% improvement _when compiled exclusively for the native CPU_, I was unable to see any real-world performance benefit when I patched my crypto-bigint dependency to a fork which used this `split_mul` implementation when reasonable.

https://github.com/kayabaNerve/avx-ifma-mul

I'd love to see further discussion and potential upstreaming (again, as explained in the README). I would have entirely made this an issue if I did not want to provide the implementation + benchmark as an artifact (hence the repository).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

AVX IFMA Accelerated `split_mul` #885

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

AVX IFMA Accelerated split_mul #885

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

AVX IFMA Accelerated `split_mul` #885