Skip to content

Feature request: Characters delimiter argument #1885

@VasLem

Description

@VasLem

I wish to develop a k-mer-character-based BPE tokenizer using your beautiful Rust package, for genomic applications. Unfortunately, it doesn't seem to support defining a characters delimiter. As I see it, it is a pretty straightforward change, instead of iterating a word by character, first split it by the delimiter and then iterate. Also, when merges are computed, in the string representation the character delimiter should also be considered. In that way, a multi-character word splitting could have been made feasible. Right now I am using a modified Python version of the BPE tokenizer made by the genius Yikai-Liao, however it would be nice to see that happening in Rust as well, and natively supported by huggingface. Unfortunately, I am still novice in working with Rust, otherwise I would make a pull request with the suggested changes. Is it something that can be worked out in the future? Or is there a way to do this with the current implementation? Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions