SplitPreTokenizer with invert true returning array with empty string

See using commit 03d86ac  See also: https://github.com/PreternaturalAI/mlx-swift-chat/issues/8

The tokenizer for `stabilityai/stablelm-2-zephyr-1_6b` has a configuration like this:

```
  "pre_tokenizer": {
    "type": "Sequence",
    "pretokenizers": [
      {
        "type": "Split",
        "pattern": { 
          "Regex": "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\r\n]*|\\s*[\r\n]+|\\s+(?!\\S)|\\s+"
        },
        "behavior": "Removed",
        "invert": true
      },
```

which ends up here:

```
class SplitPreTokenizer: PreTokenizer {
...
    func preTokenize(text: String) -> [String] {
        guard let pattern = pattern else { return [text] }
        return pattern.split(text, invert: invert)
    }
```

Given the input string `"Why did the chicken cross the road? "` it returns a array with an empty string:

```
(lldb) p pattern.split(text, invert: true)
([String]) 1 value {
  [0] = ""
}
```

I observed that if invert were `false` it gives something that look reasonable to my eyes:

```
(lldb) p pattern.split(text, invert: false)
([String]) 10 values {
  [0] = "Why"
  [1] = " did"
  [2] = " the"
  [3] = " chicken"
  [4] = " cross"
  [5] = " the"
  [6] = " road"
  [7] = "?"
  [8] = " "
  [9] = ""
}
```

I am not sure what the behavior is supposed to be here -- I wonder if the behavior of `invert` might be ... inverted?  I think the configuration is correct because the python tokenizer behaves correctly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SplitPreTokenizer with invert true returning array with empty string #55

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

SplitPreTokenizer with invert true returning array with empty string #55

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions