-
Notifications
You must be signed in to change notification settings - Fork 3.9k
[Gh-539][Encoding] Add ALPpd encoding to parquet #48345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format? or See also: |
1b78a5c to
d563ce0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the more standard place to put test data is in either arrow-testing or parquet-testing so it can be used across implementations
In this case I would recommend https://github.com/apache/parquet-testing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Thanks.
apache/parquet-testing#100
| DELTA_BYTE_ARRAY = 7, | ||
| RLE_DICTIONARY = 8, | ||
| BYTE_STREAM_SPLIT = 9, | ||
| ALP = 10, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
|
Thanks @prtkgaur -- it is super exciting to see this movement. Unfortunately, I am not familiar with the C/C++ codebase to give this a realistic review. I started the CI checks on this PR and had some comments about the testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Thanks.
apache/parquet-testing#100
| std::string tarball_path = std::string(__FILE__); | ||
| tarball_path = tarball_path.substr(0, tarball_path.find_last_of("/\\")); | ||
| tarball_path = tarball_path.substr(0, tarball_path.find_last_of("/\\")); | ||
| tarball_path += "/arrow/cpp/submodules/parquet-testing/data/floatingpoint_data.tar.gz"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Reviewer the data sits in the parquet-testing submodule
apache/parquet-testing#100
|
|
||
| // Unsafe resize without initialization - use only when you will immediately | ||
| // overwrite the memory (e.g., before memcpy). Only safe for POD types. | ||
| void UnsafeResize(size_t n) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using this over resize gave us around 2-3% performance improvement
Co-authored-by: Dhirhan Kanesalingam <[email protected]>
Also ensure that no line exceeds 90 characters
This reverts commit e85658b42b5373ef7e54295b100d1f083d55dd8d.
0c035b7 to
1cb0852
Compare
Co-authored-by: [email protected]
Rationale for this change
ALP significantly improves on the compression ratio and decompression speed over of float/double columns over other encoding/compression techniques.
What changes are included in this PR?
This PR
Introduces ALP (pseudo-decimal) encoding into c++ arrow code.
We also provide benchmarks and dataset to prove the effectiveness of the above algorithm.
Adding above needed us to add following classes.
Integration of the above code was done in
Unit tests were added to
And Benchmarks are added to
Are these changes tested?
Are there any user-facing changes?