Feature request: Characters delimiter argument

I wish to develop a k-mer-character-based BPE tokenizer using your beautiful Rust package, for genomic applications. Unfortunately, it doesn't seem to support defining a characters delimiter. As I see it, it is a pretty straightforward change, instead of iterating a word by character, first split it by the delimiter and then iterate. Also, when merges are computed, in the string representation the character delimiter should also be considered. In that way, a multi-character word splitting could have been made feasible. Right now I am using a modified Python version of the BPE tokenizer made by the genius [Yikai-Liao](https://github.com/Yikai-Liao/efficient_bpe/blob/main/ebpe_v2.py), however it would be nice to see that happening in Rust as well, and natively supported by huggingface. Unfortunately, I am still novice in working with Rust, otherwise I would make a pull request with the suggested changes. Is it something that can be worked out in the future? Or is there a way to do this with the current implementation? Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature request: Characters delimiter argument #1885

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature request: Characters delimiter argument #1885

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions