sha256 and sha512 asm code does not use modern x64 CPU instructions

While looking into #82, I realized that the assembly code for fast sha512 calculations on x64 does not make use of modern instruction sets such as AVX, AVX2 or BMI2. Instead, it "only" uses some variant of SSE.

As far as I've understood, the reasons for this are mostly historic, combined with the shift of developer focus towards more native Rust implementations. @nayuki initially developed the code over a decade ago when SSE was the fastest commonly available instruction set. Additionally, having multiple assembler targets for a given architecture adds maintenance overhead.

On relatively modern CPUs like the AMD Zen3 series, the missed performance gains are fairly substantial, though. When analysing this, I focused on SHA512 due to my particular workload.

The [libgcrypt](https://github.com/gpg/libgcrypt) library has separate assembler implementations for the different common instruction set levels, see the [cipher](https://github.com/gpg/libgcrypt/tree/master/cipher) folder:
```
sha512-ssse3-amd64.S
sha512-avx-amd64.S
sha512-avx2-bmi2-amd64.S
sha512-avx512-amd64.S 
```

Some quick benchmark numbers for `SHA-512` throughput via libgcrypt's `tests/bench-slope --repetitions 10000` on a `AMD Ryzen 5950X`:
* SSSE3 implementation: 676.5 MiB/s
  * Forced via `--disable-avx-support --disable-avx2-support` on `./configure`
* AVX implementation: 711.6 MiB/s
  * Forced via `--disable-avx2-support` on `./configure`
* AVX2 + BMI2 implementation: **1084** MiB/s

These numbers are very approximate and not for `asm-hashes` directly, but outline the 50%+ performance improvement that `asm-hashes` currently leaves on the table by "only" using SSE. I did not test AVX512 since it's unavailable on Zen3 and apparently slower than AVX2+BMI2 on Zen4, see [here](https://lists.gnupg.org/pipermail/gcrypt-devel/2022-October/005399.html).

A curious side effect of the current situation is that modern compilers like `gcc` produce faster compiled assembler code out of the [sha512.c](https://github.com/nayuki/Nayuki-web-published-code/blob/2784e763e0de1b2b868bc9765dbf78b61ae7e4a9/fast-sha2-hashes-in-x86-assembly/sha512.c) than the hand-optimized SSE code once the compiler is allowed to use modern instructions. In my brief testing, this is mainly due to BMI2 (via `-mbmi2` compiler flag) and its `RORX` instruction. The auto-generated code is closer to the performance level of the handwritten AVX implementation than the handwritten AVX2 + BMI2 one. However, the fact that `asm-hashes` would be faster if it were compiling the native C code variant directly together with a `.flag("-mbmi2")` in `build.rs`, is still notable and may be unexpected to users of the crate.

I want to emphasize that this is not a fault of the original author(s), who simply didn't have the faster instructions available at the time 🙂 

Some thoughts on potential next steps:

1. In my opinion, `asm-hashes` would benefit from some more explicit documentation on the current outdated optimization level to help developers make an informed decision about sticking with `sha2 0.10.x` and enabling the `asm` feature there to rely on `asm-hashes`.
2. If possible given the "maintenance mode", it would be great to see the introduction of an `AVX2 BMI2` optimized asm implementation to capture the major speedup processor architectures from the last 6-8 years.
3. Some of my investigation work may bring new impulses to the `hashes` `sha2/src/sha512/x86_avx2.rs` implementation as well. In particular, the lack of the `BMI2` feature set usage (and perhaps, relevant instruction availability) could be a meaningful performance topic. I plan to open an issue in the `hashes` repository to outline this further.

There are several other side topics that are related but were out-of-scope for me, such as the lack of using native SHA1/SHA256 specific CPU instructions which bring several speed improvements of 3x and more, and the new SHA512 specific CPU instructions introduced with very recent Intel processors. If there are no more plans to introduce them in this crate, some documentation to that effect could also be useful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

sha256 and sha512 asm code does not use modern x64 CPU instructions #83

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

sha256 and sha512 asm code does not use modern x64 CPU instructions #83

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions