Open
Description
Hi, I found the tokenizer behavior different from Python transformers when I use Phi-3 model.
swift-transformers
func testTokenizer() async throws {
let tokenizer = try await AutoTokenizer.from(pretrained: "mlx-community/Phi-3-mini-4k-instruct-4bit-no-q-embed")
let inputIds = tokenizer(" Hi")
print(inputIds)
// output: [1, 6324]
}
Python transformers
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("mlx-community/Phi-3-mini-4k-instruct-4bit-no-q-embed")
input_ids = tokenizer.encode(" Hi")
print(input_ids)
# output: [1, 29871, 6324]
Python transformers prepends 29871
(▁
) before 6324
. It seems to be done by the normalizer. I debugged this issue and found that the normalizer is ignored when legacy
is false
at
swift-transformers/Sources/Tokenizers/Tokenizer.swift
Lines 341 to 344 in fc65432