Open
Description
See using commit 03d86ac See also: preternatural-explore/mlx-swift-chat#8
The tokenizer for stabilityai/stablelm-2-zephyr-1_6b
has a configuration like this:
"pre_tokenizer": {
"type": "Sequence",
"pretokenizers": [
{
"type": "Split",
"pattern": {
"Regex": "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\r\n]*|\\s*[\r\n]+|\\s+(?!\\S)|\\s+"
},
"behavior": "Removed",
"invert": true
},
which ends up here:
class SplitPreTokenizer: PreTokenizer {
...
func preTokenize(text: String) -> [String] {
guard let pattern = pattern else { return [text] }
return pattern.split(text, invert: invert)
}
Given the input string "Why did the chicken cross the road? "
it returns a array with an empty string:
(lldb) p pattern.split(text, invert: true)
([String]) 1 value {
[0] = ""
}
I observed that if invert were false
it gives something that look reasonable to my eyes:
(lldb) p pattern.split(text, invert: false)
([String]) 10 values {
[0] = "Why"
[1] = " did"
[2] = " the"
[3] = " chicken"
[4] = " cross"
[5] = " the"
[6] = " road"
[7] = "?"
[8] = " "
[9] = ""
}
I am not sure what the behavior is supposed to be here -- I wonder if the behavior of invert
might be ... inverted? I think the configuration is correct because the python tokenizer behaves correctly.