Skip to content

Add bit transpose operations#6928

Merged
robert3005 merged 9 commits intodevelopfrom
rk/bittranspose
Mar 13, 2026
Merged

Add bit transpose operations#6928
robert3005 merged 9 commits intodevelopfrom
rk/bittranspose

Conversation

@robert3005
Copy link
Contributor

@robert3005 robert3005 commented Mar 13, 2026

Logic to convert bit buffers into transpose layout. This is useful where
intermediary arrays are in transpose layout

In particular this is necessary to handle DeltaArray validity correctly

On Zen 5 the bmi instructions are 20-30% faster, VBMI are ~10-20x faster
On Zen3 bmi instructions are still around 20-30% faster
On M4 the neon version is ~60-100% faster while untranspose stays the same

Full benchmark results for posterity
on Zen 5 machine (m8azn)

Compiling fastlanes v0.5.0 (/home/ubuntu/fastlanes)
    Finished `bench` profile [optimized] target(s) in 0.56s
     Running benches/bit_transpose.rs (target/release/deps/bit_transpose-dd4b19170a0386ff)
Timer precision: 10 ns
bit_transpose                      fastest       │ slowest       │ median        │ mean          │ samples │ iters
├─ transpose_scalar                59.8 ns       │ 1.059 µs      │ 69.8 ns       │ 75.7 ns       │ 100     │ 100
├─ transpose_scalar_throughput     8.349 µs      │ 15.75 µs      │ 8.369 µs      │ 8.443 µs      │ 100     │ 100
├─ untranspose_scalar              39.8 ns       │ 40.73 ns      │ 40.11 ns      │ 39.99 ns      │ 100     │ 3200
├─ untranspose_scalar_throughput   8.889 µs      │ 14.26 µs      │ 8.909 µs      │ 8.985 µs      │ 100     │ 100
╰─ x86                                           │               │               │               │         │
   ├─ transpose_bmi2               29.64 ns      │ 30.26 ns      │ 29.95 ns      │ 29.94 ns      │ 100     │ 6400
   ├─ transpose_bmi2_throughput    6.309 µs      │ 10.92 µs      │ 6.339 µs      │ 6.386 µs      │ 100     │ 100
   ├─ transpose_vbmi               3.608 ns      │ 3.667 ns      │ 3.628 ns      │ 3.631 ns      │ 100     │ 51200
   ├─ transpose_vbmi_throughput    504.8 ns      │ 524.8 ns      │ 509.8 ns      │ 508 ns        │ 100     │ 200
   ├─ untranspose_bmi2             27.61 ns      │ 70.42 ns      │ 27.76 ns      │ 28.24 ns      │ 100     │ 6400
   ├─ untranspose_bmi2_throughput  6.199 µs      │ 10.57 µs      │ 6.219 µs      │ 6.262 µs      │ 100     │ 100
   ├─ untranspose_vbmi             3.589 ns      │ 3.628 ns      │ 3.608 ns      │ 3.604 ns      │ 100     │ 51200
   ╰─ untranspose_vbmi_throughput  489.8 ns      │ 504.8 ns      │ 494.8 ns      │ 496.7 ns      │ 100     │ 200

on a Zen 3 machine (c6a)

Compiling fastlanes v0.5.0 (/home/ubuntu/fastlanes)
    Finished `bench` profile [optimized] target(s) in 1.02s
     Running benches/bit_transpose.rs (target/release/deps/bit_transpose-dd4b19170a0386ff)
Timer precision: 20 ns
bit_transpose                      fastest       │ slowest       │ median        │ mean          │ samples │ iters
├─ transpose_scalar                70.65 ns      │ 962.9 ns      │ 70.97 ns      │ 81.99 ns      │ 100     │ 3200
├─ transpose_scalar_throughput     15.68 µs      │ 37.39 µs      │ 15.72 µs      │ 16.04 µs      │ 100     │ 100
├─ untranspose_scalar              72.84 ns      │ 242.2 ns      │ 75.04 ns      │ 77.92 ns      │ 100     │ 3200
├─ untranspose_scalar_throughput   16.13 µs      │ 32.4 µs       │ 16.26 µs      │ 16.69 µs      │ 100     │ 100
╰─ x86                                           │               │               │               │         │
   ├─ transpose_bmi2               58.75 ns      │ 229.4 ns      │ 59.09 ns      │ 61.18 ns      │ 100     │ 3200
   ├─ transpose_bmi2_throughput    12.33 µs      │ 27.84 µs      │ 12.41 µs      │ 12.68 µs      │ 100     │ 100
   ├─ transpose_vbmi               warning: No benchmark function registered for 'transpose_vbmi'

   ├─ transpose_vbmi_throughput    warning: No benchmark function registered for 'transpose_vbmi_throughput'

   ├─ untranspose_bmi2             57.22 ns      │ 452.2 ns      │ 58.78 ns      │ 63.58 ns      │ 100     │ 3200
   ├─ untranspose_bmi2_throughput  13.23 µs      │ 25.47 µs      │ 13.29 µs      │ 13.61 µs      │ 100     │ 100
   ├─ untranspose_vbmi             warning: No benchmark function registered for 'untranspose_vbmi'

   ╰─ untranspose_vbmi_throughput  warning: No benchmark function registered for 'untranspose_vbmi_throughput'

on a M4 Max

Finished [`bench` profile [optimized]](https://doc.rust-lang.org/cargo/reference/profiles.html#default-profiles) target(s) in 0.02s
     Running benches/bit_transpose.rs (target/release/deps/bit_transpose-9780c0103c6d2dc3)
Timer precision: 41 ns
bit_transpose                      fastest       │ slowest       │ median        │ mean          │ samples │ iters
├─ transpose_scalar                24.48 ns      │ 25.78 ns      │ 24.97 ns      │ 25.03 ns      │ 100     │ 25600
├─ transpose_scalar_throughput     5.249 µs      │ 9.166 µs      │ 5.332 µs      │ 5.545 µs      │ 100     │ 100
├─ untranspose_scalar              25.79 ns      │ 37.5 ns       │ 26.44 ns      │ 28.07 ns      │ 100     │ 12800
├─ untranspose_scalar_throughput   5.457 µs      │ 7.999 µs      │ 5.583 µs      │ 5.821 µs      │ 100     │ 100
╰─ aarch64                                       │               │               │               │         │
   ├─ transpose_neon               14.23 ns      │ 20.57 ns      │ 14.55 ns      │ 15.22 ns      │ 100     │ 25600
   ├─ transpose_neon_throughput    3.187 µs      │ 4.52 µs       │ 3.228 µs      │ 3.262 µs      │ 100     │ 200
   ├─ untranspose_neon             23.67 ns      │ 35.06 ns      │ 24.32 ns      │ 25.19 ns      │ 100     │ 25600
   ╰─ untranspose_neon_throughput  4.874 µs      │ 7.082 µs      │ 4.958 µs      │ 5.174 µs      │ 100     │ 100Finished 

Signed-off-by: Robert Kruszewski github@robertk.io

Logic to convert bit buffers into transpose layout. This is useful where
intermediary arrays are in transpose layout

Signed-off-by: Robert Kruszewski <github@robertk.io>
Signed-off-by: Robert Kruszewski <github@robertk.io>
Signed-off-by: Robert Kruszewski <github@robertk.io>
Signed-off-by: Robert Kruszewski <github@robertk.io>
@robert3005 robert3005 added the changelog/feature A new feature label Mar 13, 2026
Signed-off-by: Robert Kruszewski <github@robertk.io>
Signed-off-by: Robert Kruszewski <github@robertk.io>
Copy link
Contributor

@joseph-isaacs joseph-isaacs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG otherwise

Signed-off-by: Robert Kruszewski <github@robertk.io>
Signed-off-by: Robert Kruszewski <github@robertk.io>
Signed-off-by: Robert Kruszewski <github@robertk.io>
@robert3005 robert3005 merged commit 0b981a8 into develop Mar 13, 2026
54 checks passed
@robert3005 robert3005 deleted the rk/bittranspose branch March 13, 2026 12:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/feature A new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants