POC: Benchmark for AVX512-VBMI based bitpack decoding for a bitwidth of 1 #8102

jhorstmann · 2025-08-09T16:55:18Z

Which issue does this PR close?

This is a proof of concept for possibly the fastest way of decoding bitpacked data, and also a showcase why writing short rle runs can be detrimental to performance if decoding of bitpacked data is fast.

Related to Parquet LevelEncoder is much too eager to write short rle runs #7739.

Rationale for this change

At the moment I'm not proposing to integrate this into the arrow codebase, the code would need further changes to support processing arbitrary batch sizes and is currently also only supporting a single bitwidth and only u8 as the target data type.

What changes are included in this PR?

The benchmark includes a custom rle/bitpacking hybrid encoder that only supports a bitwidth of 1 and only writes bitpacked runs. On mostly random input data, the size of the encoded buffer is comparable to the size created by the standard RleEncoder. Decoding using standard RleDecoder also shows that decoding bitpacked data is slightly faster.

To run the benchmarks, you will need Rust 1.89, which stabilized avx512 support, and an avx512-capable machine (at least Intel Icelake or AMD Zen4).

$ RUSTFLAGS="-Ctarget-cpu=native" cargo bench --features experimental --bench rle

rle_decoder/decode_bitpacked
                        time:   [398.06 µs 398.66 µs 399.32 µs]
                        thrpt:  [2.6259 Gelem/s 2.6302 Gelem/s 2.6342 Gelem/s]
rle_decoder/decode_hybrid
                        time:   [540.05 µs 542.22 µs 544.84 µs]
                        thrpt:  [1.9246 Gelem/s 1.9339 Gelem/s 1.9416 Gelem/s]

The results are more interesting when decoding with a custom, AVX512-VBMI optimized decoder:

custom/decode_bitpacked time:   [17.642 µs 17.661 µs 17.683 µs]
                        thrpt:  [59.297 Gelem/s 59.372 Gelem/s 59.435 Gelem/s]
custom/decode_hybrid    time:   [87.593 µs 87.866 µs 88.184 µs]
                        thrpt:  [11.891 Gelem/s 11.934 Gelem/s 11.971 Gelem/s]

Decoding bitpacked data gets a speedup of about 22x, while decoding hybrid rle data only gets about 6x faster. My guess would be that this is caused by branch prediction, or the call to a memset-like function which is not optimized for short data.

Are these changes tested?

Are there any user-facing changes?

Benchmark for AVX512-VBMI based bitpack decoding for a bitwidth of 1

0011fe5

github-actions bot added the parquet Changes to the parquet crate label Aug 9, 2025

alamb mentioned this pull request Aug 11, 2025

Add "update branch" option in PRs #8099

Merged

alamb marked this pull request as draft August 13, 2025 18:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

POC: Benchmark for AVX512-VBMI based bitpack decoding for a bitwidth of 1 #8102

POC: Benchmark for AVX512-VBMI based bitpack decoding for a bitwidth of 1 #8102

jhorstmann commented Aug 9, 2025

Uh oh!

Uh oh!

POC: Benchmark for AVX512-VBMI based bitpack decoding for a bitwidth of 1 #8102

Are you sure you want to change the base?

POC: Benchmark for AVX512-VBMI based bitpack decoding for a bitwidth of 1 #8102

Conversation

jhorstmann commented Aug 9, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Uh oh!