-
Notifications
You must be signed in to change notification settings - Fork 1k
Open
Labels
Description
Hello,
I use the Rust library directly and call encode on a given text.
Because I use BERT with a limited token window, very often the encoded tokens are larger than the max sequence_length.
Currently, my code splits the tokens, but it has a few drawbacks:
- There is no API to read and split the encodings into chunks, so the use needs to slice into and copy all the required variables from the
Encodingstruct. - The caller needs to reserve BERT's special tokens and append them.
In addition, I have a custom logic to find where to split the tokens (for a classification task), so passing to encode a callback that also finds the splits can be an additional improvement that can be useful.
I want to contribute this feature request myself and introduce a new parameter to encode, or a new function - encode_with_chunks, but I'd like to get feedback on this approach first :)