POC: Benchmark for AVX512-VBMI based bitpack decoding for a bitwidth of 1 #8102
+373
−0
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
This is a proof of concept for possibly the fastest way of decoding bitpacked data, and also a showcase why writing short rle runs can be detrimental to performance if decoding of bitpacked data is fast.
Rationale for this change
At the moment I'm not proposing to integrate this into the arrow codebase, the code would need further changes to support processing arbitrary batch sizes and is currently also only supporting a single bitwidth and only
u8
as the target data type.What changes are included in this PR?
The benchmark includes a custom rle/bitpacking hybrid encoder that only supports a bitwidth of 1 and only writes bitpacked runs. On mostly random input data, the size of the encoded buffer is comparable to the size created by the standard
RleEncoder
. Decoding using standardRleDecoder
also shows that decoding bitpacked data is slightly faster.To run the benchmarks, you will need Rust 1.89, which stabilized avx512 support, and an avx512-capable machine (at least Intel Icelake or AMD Zen4).
The results are more interesting when decoding with a custom, AVX512-VBMI optimized decoder:
Decoding bitpacked data gets a speedup of about 22x, while decoding hybrid rle data only gets about 6x faster. My guess would be that this is caused by branch prediction, or the call to a
memset
-like function which is not optimized for short data.Are these changes tested?
Are there any user-facing changes?