Skip to content

SplitPreTokenizer with invert true returning array with empty string #55

Open
@davidkoski

Description

@davidkoski

See using commit 03d86ac See also: preternatural-explore/mlx-swift-chat#8

The tokenizer for stabilityai/stablelm-2-zephyr-1_6b has a configuration like this:

  "pre_tokenizer": {
    "type": "Sequence",
    "pretokenizers": [
      {
        "type": "Split",
        "pattern": { 
          "Regex": "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\r\n]*|\\s*[\r\n]+|\\s+(?!\\S)|\\s+"
        },
        "behavior": "Removed",
        "invert": true
      },

which ends up here:

class SplitPreTokenizer: PreTokenizer {
...
    func preTokenize(text: String) -> [String] {
        guard let pattern = pattern else { return [text] }
        return pattern.split(text, invert: invert)
    }

Given the input string "Why did the chicken cross the road? " it returns a array with an empty string:

(lldb) p pattern.split(text, invert: true)
([String]) 1 value {
  [0] = ""
}

I observed that if invert were false it gives something that look reasonable to my eyes:

(lldb) p pattern.split(text, invert: false)
([String]) 10 values {
  [0] = "Why"
  [1] = " did"
  [2] = " the"
  [3] = " chicken"
  [4] = " cross"
  [5] = " the"
  [6] = " road"
  [7] = "?"
  [8] = " "
  [9] = ""
}

I am not sure what the behavior is supposed to be here -- I wonder if the behavior of invert might be ... inverted? I think the configuration is correct because the python tokenizer behaves correctly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions