-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
I wish to develop a k-mer-character-based BPE tokenizer using your beautiful Rust package, for genomic applications. Unfortunately, it doesn't seem to support defining a characters delimiter. As I see it, it is a pretty straightforward change, instead of iterating a word by character, first split it by the delimiter and then iterate. Also, when merges are computed, in the string representation the character delimiter should also be considered. In that way, a multi-character word splitting could have been made feasible. Right now I am using a modified Python version of the BPE tokenizer made by the genius Yikai-Liao, however it would be nice to see that happening in Rust as well, and natively supported by huggingface. Unfortunately, I am still novice in working with Rust, otherwise I would make a pull request with the suggested changes. Is it something that can be worked out in the future? Or is there a way to do this with the current implementation? Thank you!