Open
Description
I have a couple of suggestions for the tokenizer API -- things that I have needed to work around here: https://github.com/ml-explore/mlx-swift-examples/blob/main/Libraries/LLM/Tokenizer.swift
-
add
eosToken
/eosTokenId
to theTokenizer
protocol- this is needed to know when to stop producing tokens
Tokenizer
already hasunknownToken
- I don't know if any of the other special tokens should be exposed, e.g.
bosToken
-
have a way to add to
TokenizerModel/knownTokenizers
or otherwise handle unknown tokenizers- right now it would probably be sufficient to map to
"PreTrainedTokenizer": BPETokenizer.self
- but in the future this might need to be more flexible
TokenizerModel
is internal as are the various classes likeBPETokenizer
- in my workaround I mapped string -> string, e.g.
"Qwen2Tokenizer": "PreTrainedTokenizer"
, which is perhaps the right level -- not exposing too much of the implementation - anyway, some kind of API to allow registration of overrides like this or perhaps just "PreTrainedTokenizer" as a fallback for now
- right now it would probably be sufficient to map to
If these fit in with the vision for the tokenizer API, please consider them!
Thanks