Skip to content

Support splitting to chunks during encode #1897

@xrl1

Description

@xrl1

Hello,
I use the Rust library directly and call encode on a given text.
Because I use BERT with a limited token window, very often the encoded tokens are larger than the max sequence_length.
Currently, my code splits the tokens, but it has a few drawbacks:

  1. There is no API to read and split the encodings into chunks, so the use needs to slice into and copy all the required variables from the Encoding struct.
  2. The caller needs to reserve BERT's special tokens and append them.

In addition, I have a custom logic to find where to split the tokens (for a classification task), so passing to encode a callback that also finds the splits can be an additional improvement that can be useful.

I want to contribute this feature request myself and introduce a new parameter to encode, or a new function - encode_with_chunks, but I'd like to get feedback on this approach first :)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions