XLMRobertaTokenizer error

Hi. Came up with an issue trying to use `UnigramTokenizer` with the XLMRobertaTokenizer vocabulary.

Reproduce with the following on `FactoryTests.swift`, after adding the relevant entries to `knownTokenizers`, e.g `class XLMRobertaTokenizer: UnigramTokenizer {}`, etc.

```swift
func testE5() async throws {
    let tokenizer = try await AutoTokenizer.from(pretrained: "intfloat/multilingual-e5-small", hubApi: hubApi)
    let inputIds = tokenizer("query: how much protein should a female eat")
    print(tokenizer.decode(tokens: inputIds))
    XCTAssertEqual(inputIds, [0, 41, 1294, 12, 3642, 5045, 21308, 5608, 10, 117776, 73203, 2])
}
```

results in error:

`Swift/NativeDictionary.swift:770: Fatal error: Duplicate values for key: 'َّ'`

Patching `UnigramTokenizer.swift:66` with the following will get the test passing:

```swift
var tmp = [String: Int]()

vocab.map { $0.token }.enumerated().forEach { (v,k) in
    tmp[k] = v
}

tokensToIds = tmp
```

This patch does not address the root cause and will obviously cause some vocabulary entries to be lost. From visual inspection seems a bunch of entries of what look like Thai script suffer from this issue.

I don't know enough about Swift strings to determine if this is a bug in `swift-transformers` or a problem with the vocabulary file.

Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

XLMRobertaTokenizer error #99

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

XLMRobertaTokenizer error #99

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions