-
Notifications
You must be signed in to change notification settings - Fork 26
sha256 and sha512 asm code does not use modern x64 CPU instructions #83
Description
While looking into #82, I realized that the assembly code for fast sha512 calculations on x64 does not make use of modern instruction sets such as AVX, AVX2 or BMI2. Instead, it "only" uses some variant of SSE.
As far as I've understood, the reasons for this are mostly historic, combined with the shift of developer focus towards more native Rust implementations. @nayuki initially developed the code over a decade ago when SSE was the fastest commonly available instruction set. Additionally, having multiple assembler targets for a given architecture adds maintenance overhead.
On relatively modern CPUs like the AMD Zen3 series, the missed performance gains are fairly substantial, though. When analysing this, I focused on SHA512 due to my particular workload.
The libgcrypt library has separate assembler implementations for the different common instruction set levels, see the cipher folder:
sha512-ssse3-amd64.S
sha512-avx-amd64.S
sha512-avx2-bmi2-amd64.S
sha512-avx512-amd64.S
Some quick benchmark numbers for SHA-512 throughput via libgcrypt's tests/bench-slope --repetitions 10000 on a AMD Ryzen 5950X:
- SSSE3 implementation: 676.5 MiB/s
- Forced via
--disable-avx-support --disable-avx2-supporton./configure
- Forced via
- AVX implementation: 711.6 MiB/s
- Forced via
--disable-avx2-supporton./configure
- Forced via
- AVX2 + BMI2 implementation: 1084 MiB/s
These numbers are very approximate and not for asm-hashes directly, but outline the 50%+ performance improvement that asm-hashes currently leaves on the table by "only" using SSE. I did not test AVX512 since it's unavailable on Zen3 and apparently slower than AVX2+BMI2 on Zen4, see here.
A curious side effect of the current situation is that modern compilers like gcc produce faster compiled assembler code out of the sha512.c than the hand-optimized SSE code once the compiler is allowed to use modern instructions. In my brief testing, this is mainly due to BMI2 (via -mbmi2 compiler flag) and its RORX instruction. The auto-generated code is closer to the performance level of the handwritten AVX implementation than the handwritten AVX2 + BMI2 one. However, the fact that asm-hashes would be faster if it were compiling the native C code variant directly together with a .flag("-mbmi2") in build.rs, is still notable and may be unexpected to users of the crate.
I want to emphasize that this is not a fault of the original author(s), who simply didn't have the faster instructions available at the time 🙂
Some thoughts on potential next steps:
- In my opinion,
asm-hasheswould benefit from some more explicit documentation on the current outdated optimization level to help developers make an informed decision about sticking withsha2 0.10.xand enabling theasmfeature there to rely onasm-hashes. - If possible given the "maintenance mode", it would be great to see the introduction of an
AVX2 BMI2optimized asm implementation to capture the major speedup processor architectures from the last 6-8 years. - Some of my investigation work may bring new impulses to the
hashessha2/src/sha512/x86_avx2.rsimplementation as well. In particular, the lack of theBMI2feature set usage (and perhaps, relevant instruction availability) could be a meaningful performance topic. I plan to open an issue in thehashesrepository to outline this further.
There are several other side topics that are related but were out-of-scope for me, such as the lack of using native SHA1/SHA256 specific CPU instructions which bring several speed improvements of 3x and more, and the new SHA512 specific CPU instructions introduced with very recent Intel processors. If there are no more plans to introduce them in this crate, some documentation to that effect could also be useful.