Description
Hi. Came up with an issue trying to use UnigramTokenizer
with the XLMRobertaTokenizer vocabulary.
Reproduce with the following on FactoryTests.swift
, after adding the relevant entries to knownTokenizers
, e.g class XLMRobertaTokenizer: UnigramTokenizer {}
, etc.
func testE5() async throws {
let tokenizer = try await AutoTokenizer.from(pretrained: "intfloat/multilingual-e5-small", hubApi: hubApi)
let inputIds = tokenizer("query: how much protein should a female eat")
print(tokenizer.decode(tokens: inputIds))
XCTAssertEqual(inputIds, [0, 41, 1294, 12, 3642, 5045, 21308, 5608, 10, 117776, 73203, 2])
}
results in error:
Swift/NativeDictionary.swift:770: Fatal error: Duplicate values for key: 'َّ'
Patching UnigramTokenizer.swift:66
with the following will get the test passing:
var tmp = [String: Int]()
vocab.map { $0.token }.enumerated().forEach { (v,k) in
tmp[k] = v
}
tokensToIds = tmp
This patch does not address the root cause and will obviously cause some vocabulary entries to be lost. From visual inspection seems a bunch of entries of what look like Thai script suffer from this issue.
I don't know enough about Swift strings to determine if this is a bug in swift-transformers
or a problem with the vocabulary file.
Thanks.