Skip to content

Conversation

@prtkgaur
Copy link

@prtkgaur prtkgaur commented Dec 5, 2025

Co-authored-by: [email protected]

Rationale for this change

ALP significantly improves on the compression ratio and decompression speed over of float/double columns over other encoding/compression techniques.

What changes are included in this PR?

This PR
Introduces ALP (pseudo-decimal) encoding into c++ arrow code.
We also provide benchmarks and dataset to prove the effectiveness of the above algorithm.

Adding above needed us to add following classes.

  • Alp h/cc : Houses core logic for encoding and decoding.
  • Sampler h/cc : Houses logic to sample and select parameters for encoding.
  • AlpWrapper h/cc : Binds together Alp and Sampler classes.

Integration of the above code was done in

  • Encoder/Decoder cc which exposes wrapper to encode buffer of data.

Unit tests were added to

  • alp_test.cc

And Benchmarks are added to

  • encoding_benchmark.cc and encoding_alp_benchmark.cc

Are these changes tested?

  • We have added unit tests to test the code.
  • Also the benchmarks have been added that cover wide variety of floating point values from low precision to high precision.

Are there any user-facing changes?

  • It's a new encoding so the only impact is query performance which we claim will only get better.

@github-actions
Copy link

github-actions bot commented Dec 5, 2025

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@prtkgaur prtkgaur force-pushed the gh540-alp-pseudoDecimal-encoding branch 3 times, most recently from 1b78a5c to d563ce0 Compare December 7, 2025 15:46
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the more standard place to put test data is in either arrow-testing or parquet-testing so it can be used across implementations

In this case I would recommend https://github.com/apache/parquet-testing

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Thanks.
apache/parquet-testing#100

DELTA_BYTE_ARRAY = 7,
RLE_DICTIONARY = 8,
BYTE_STREAM_SPLIT = 9,
ALP = 10,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@alamb
Copy link
Contributor

alamb commented Dec 8, 2025

Thanks @prtkgaur -- it is super exciting to see this movement.

Unfortunately, I am not familiar with the C/C++ codebase to give this a realistic review.

I started the CI checks on this PR and had some comments about the testing.

@prtkgaur prtkgaur changed the title [Gh540] Add ALPpd encoding to parquet [Gh539] Add ALPpd encoding to parquet Dec 8, 2025
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Thanks.
apache/parquet-testing#100

std::string tarball_path = std::string(__FILE__);
tarball_path = tarball_path.substr(0, tarball_path.find_last_of("/\\"));
tarball_path = tarball_path.substr(0, tarball_path.find_last_of("/\\"));
tarball_path += "/arrow/cpp/submodules/parquet-testing/data/floatingpoint_data.tar.gz";
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Reviewer the data sits in the parquet-testing submodule
apache/parquet-testing#100


// Unsafe resize without initialization - use only when you will immediately
// overwrite the memory (e.g., before memcpy). Only safe for POD types.
void UnsafeResize(size_t n) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using this over resize gave us around 2-3% performance improvement

@prtkgaur prtkgaur changed the title [Gh539] Add ALPpd encoding to parquet [Gh539][Encoding] Add ALPpd encoding to parquet Dec 8, 2025
@prtkgaur prtkgaur changed the title [Gh539][Encoding] Add ALPpd encoding to parquet [Gh-539][Encoding] Add ALPpd encoding to parquet Dec 8, 2025
@prtkgaur prtkgaur force-pushed the gh540-alp-pseudoDecimal-encoding branch from 0c035b7 to 1cb0852 Compare December 8, 2025 23:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants