Skip to content

Expose Encoding attributes via the buffer protocol interface #1789

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mariosasko
Copy link
Contributor

This PR enables access to the underlying buffers of an Encoding object via the buffer protocol interface, allowing for efficient conversion from Rust to Python for types that support that interface (e.g., NumPy, PyTorch, PyArrow).

This can save >20% of time when tokenizing datasets (with longer sequences) based on my benchmarks.

@Narsil
Copy link
Collaborator

Narsil commented Jun 16, 2025

Hey, thanks for the PR, however, if you've noticed you removed the abi-py38 flag, which makes this code non portable.

Buffers were stabilized in Py 3.11 https://docs.python.org/3.11/c-api/buffer.html#bufferobjects so we most likely will have to wait in order to get this rolling : https://devguide.python.org/versions/

I have tried in safetensors to get something sound using feature flags to use the buffers only on those version but honestly it's super messy to distribute various ABIs, keep the code clean and still give those features.

If anyone has suggestions on how to get the best of all worlds, we're all ears.

@mariosasko
Copy link
Contributor Author

Hi! This API is slightly advanced, so I guess it can wait 🙂

I have tried in safetensors to get something sound using feature flags to use the buffers only on those version but honestly it's super messy to distribute various ABIs, keep the code clean and still give those features.

For instance, pyca/cryptography is using the feature flags to support the buffer interface, but this indeed adds some complexity, so probably not worth it.

@Narsil
Copy link
Collaborator

Narsil commented Jun 18, 2025

I checked cryptography, it doesn't seem like they are using the abi3 features https://github.com/pyca/cryptography/blob/fe5ba4dafaf927be60066e7b6b4763524934faf3/src/rust/src/buf.rs#L31

https://github.com/pyca/cryptography/blob/fe5ba4dafaf927be60066e7b6b4763524934faf3/src/rust/Cargo.toml#L32-L34

That's where I had the issues when I did something similar in safetensors. The issue is that there is no nice way to keep a simple build system (pip install -e . for instance) by detecting the current python version in CLI and still keep something relatively simple in distributed builds too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants