Skip to content

POC: Benchmark for AVX512-VBMI based bitpack decoding for a bitwidth of 1 #8102

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jhorstmann
Copy link
Contributor

Which issue does this PR close?

This is a proof of concept for possibly the fastest way of decoding bitpacked data, and also a showcase why writing short rle runs can be detrimental to performance if decoding of bitpacked data is fast.

Rationale for this change

At the moment I'm not proposing to integrate this into the arrow codebase, the code would need further changes to support processing arbitrary batch sizes and is currently also only supporting a single bitwidth and only u8 as the target data type.

What changes are included in this PR?

The benchmark includes a custom rle/bitpacking hybrid encoder that only supports a bitwidth of 1 and only writes bitpacked runs. On mostly random input data, the size of the encoded buffer is comparable to the size created by the standard RleEncoder. Decoding using standard RleDecoder also shows that decoding bitpacked data is slightly faster.

To run the benchmarks, you will need Rust 1.89, which stabilized avx512 support, and an avx512-capable machine (at least Intel Icelake or AMD Zen4).

$ RUSTFLAGS="-Ctarget-cpu=native" cargo bench --features experimental --bench rle
rle_decoder/decode_bitpacked
                        time:   [398.06 µs 398.66 µs 399.32 µs]
                        thrpt:  [2.6259 Gelem/s 2.6302 Gelem/s 2.6342 Gelem/s]
rle_decoder/decode_hybrid
                        time:   [540.05 µs 542.22 µs 544.84 µs]
                        thrpt:  [1.9246 Gelem/s 1.9339 Gelem/s 1.9416 Gelem/s]

The results are more interesting when decoding with a custom, AVX512-VBMI optimized decoder:

custom/decode_bitpacked time:   [17.642 µs 17.661 µs 17.683 µs]
                        thrpt:  [59.297 Gelem/s 59.372 Gelem/s 59.435 Gelem/s]
custom/decode_hybrid    time:   [87.593 µs 87.866 µs 88.184 µs]
                        thrpt:  [11.891 Gelem/s 11.934 Gelem/s 11.971 Gelem/s]

Decoding bitpacked data gets a speedup of about 22x, while decoding hybrid rle data only gets about 6x faster. My guess would be that this is caused by branch prediction, or the call to a memset-like function which is not optimized for short data.

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the parquet Changes to the parquet crate label Aug 9, 2025
@alamb alamb marked this pull request as draft August 13, 2025 18:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant