Support splitting to chunks during encode

Hello,
I use the Rust library directly and call encode on a given text.
Because I use BERT with a limited token window, very often the encoded tokens are larger than the max `sequence_length`.
Currently, my code splits the tokens, but it has a few drawbacks:
1. There is no API to read and split the encodings into chunks, so the use needs to slice into and copy all the required variables from the `Encoding` struct.
2. The caller needs to reserve BERT's special tokens and append them.

In addition, I have a custom logic to find where to split the tokens (for a classification task), so passing to `encode` a callback that also finds the splits can be an additional improvement that can be useful.

I want to contribute this feature request myself and introduce a new parameter to `encode`, or a new function - `encode_with_chunks`, but I'd like to get feedback on this approach first :)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support splitting to chunks during encode #1897

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support splitting to chunks during encode #1897

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions