Skip to content
This repository was archived by the owner on Aug 18, 2025. It is now read-only.
This repository was archived by the owner on Aug 18, 2025. It is now read-only.

sha256 and sha512 asm code does not use modern x64 CPU instructions #83

@invd

Description

@invd

While looking into #82, I realized that the assembly code for fast sha512 calculations on x64 does not make use of modern instruction sets such as AVX, AVX2 or BMI2. Instead, it "only" uses some variant of SSE.

As far as I've understood, the reasons for this are mostly historic, combined with the shift of developer focus towards more native Rust implementations. @nayuki initially developed the code over a decade ago when SSE was the fastest commonly available instruction set. Additionally, having multiple assembler targets for a given architecture adds maintenance overhead.

On relatively modern CPUs like the AMD Zen3 series, the missed performance gains are fairly substantial, though. When analysing this, I focused on SHA512 due to my particular workload.

The libgcrypt library has separate assembler implementations for the different common instruction set levels, see the cipher folder:

sha512-ssse3-amd64.S
sha512-avx-amd64.S
sha512-avx2-bmi2-amd64.S
sha512-avx512-amd64.S 

Some quick benchmark numbers for SHA-512 throughput via libgcrypt's tests/bench-slope --repetitions 10000 on a AMD Ryzen 5950X:

  • SSSE3 implementation: 676.5 MiB/s
    • Forced via --disable-avx-support --disable-avx2-support on ./configure
  • AVX implementation: 711.6 MiB/s
    • Forced via --disable-avx2-support on ./configure
  • AVX2 + BMI2 implementation: 1084 MiB/s

These numbers are very approximate and not for asm-hashes directly, but outline the 50%+ performance improvement that asm-hashes currently leaves on the table by "only" using SSE. I did not test AVX512 since it's unavailable on Zen3 and apparently slower than AVX2+BMI2 on Zen4, see here.

A curious side effect of the current situation is that modern compilers like gcc produce faster compiled assembler code out of the sha512.c than the hand-optimized SSE code once the compiler is allowed to use modern instructions. In my brief testing, this is mainly due to BMI2 (via -mbmi2 compiler flag) and its RORX instruction. The auto-generated code is closer to the performance level of the handwritten AVX implementation than the handwritten AVX2 + BMI2 one. However, the fact that asm-hashes would be faster if it were compiling the native C code variant directly together with a .flag("-mbmi2") in build.rs, is still notable and may be unexpected to users of the crate.

I want to emphasize that this is not a fault of the original author(s), who simply didn't have the faster instructions available at the time 🙂

Some thoughts on potential next steps:

  1. In my opinion, asm-hashes would benefit from some more explicit documentation on the current outdated optimization level to help developers make an informed decision about sticking with sha2 0.10.x and enabling the asm feature there to rely on asm-hashes.
  2. If possible given the "maintenance mode", it would be great to see the introduction of an AVX2 BMI2 optimized asm implementation to capture the major speedup processor architectures from the last 6-8 years.
  3. Some of my investigation work may bring new impulses to the hashes sha2/src/sha512/x86_avx2.rs implementation as well. In particular, the lack of the BMI2 feature set usage (and perhaps, relevant instruction availability) could be a meaningful performance topic. I plan to open an issue in the hashes repository to outline this further.

There are several other side topics that are related but were out-of-scope for me, such as the lack of using native SHA1/SHA256 specific CPU instructions which bring several speed improvements of 3x and more, and the new SHA512 specific CPU instructions introduced with very recent Intel processors. If there are no more plans to introduce them in this crate, some documentation to that effect could also be useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions