Edge tokenization issues: Unicode parsing

              Regarding the [missing tokes in the parsed vocabulary](https://github.com/huggingface/swift-transformers/pull/113#issuecomment-2267520368), this is my documentation after tracking one of the issues.

First, we are parsing the JSON file (`tokenizer.json`) using [`JSONSerialization.jsonObject`](https://github.com/huggingface/swift-transformers/blob/main/Sources/Hub/HubApi.swift#L103-L106). This reads data as Foundation objects, parsing tokens from the vocab dictionary as `NSString` instances. This is a good thing. `String`s cannot be used as keys in the vocab dictionary because equality only considers the [Unicode canonical representation](https://developer.apple.com/documentation/swift/string#Modifying-and-Comparing-Strings). Parsing the JSON and casting to `[String : Int]` would ignore multiple entries.

However, I found that `JSONSerialization` fails to correctly parse some strings. Consider the following test case:

```swift
    func testArrayParsingWithBOMPrefix() {
        // The second one starts with a BOM prefix
        let items = ["a", "\u{feff}a"]

        // Neither Strings nor NSStrings are equal
        XCTAssertNotEqual(items[0], items[1])
        XCTAssertNotEqual(items[0] as NSString, items[1] as NSString)

        // JSONDecoder works
        let jsonData = try! JSONSerialization.data(withJSONObject: items, options: [])
        let decoder = JSONDecoder()
        let decoded = try! decoder.decode([String].self, from: jsonData)
        XCTAssertEqual(decoded, items)

        // JSONSerialization seems to ignore the BOM.
        // The decoded array contains two items, but they are the same NSString.
        let ns_decoded = try! JSONSerialization.jsonObject(with: jsonData, options: []) as! NSArray
        XCTAssertEqual(ns_decoded.count, items.count)                               // passes
        XCTAssertNotEqual(ns_decoded[0] as! NSString, ns_decoded[1] as! NSString)   // fails
        XCTAssertEqual(ns_decoded as! [String], items)                              // fails

        // Compare unicodeScalars
        func scalars(_ string: String) -> [UInt32] {
            string.unicodeScalars.map { $0.value }
        }
        for (decoded, expected) in zip(ns_decoded, items) {
            let decodedScalars = scalars(decoded as! String)
            let expectedScalars = scalars(expected)
            XCTAssertEqual(decodedScalars, expectedScalars)         // first passes, second fails
        }
    }
```

There are two strings in the test array. The second one **starts** with a [BOM prefix](https://unicode-explorer.com/c/FEFF). The prefix is ignored when parsing the two `NSString`s, as confirmed by looking at the unicode scalars in the debugger. Unfortunately, the Gemma vocab contains some duplicate entries with/without a BOM prefix, so reading them into a dictionary skips some entries.

Interestingly, all the tests pass if the BOM character is in the middle of the string. Replacing the test items with these works fine:

```
        // If the non-breaking space is inside the String, all tests pass
//        let items = ["ab", "a\u{feff}b"]
```

I suspect [this is used for parsing](https://developer.apple.com/documentation/foundation/jsonserialization/1418059-jsonobject), and the stream is incorrectly assumed to start with a BOM even though it's in the middle of the actual json data.

Also interestingly, `JSONDecoder` works and can decode the two distinct String instances in the array. We are not using `JSONDecoder` in this project because:

- The structure of the JSON files to be parsed is quite open and flexible, I don't think it would be straightforward to write a decodable structure that represents it. Instead, we use dynamic member lookup to navigate the contents.
- We can't use `String` instances for vocab keys, as mentioned above.

I'm not sure how to deal with this.

_Originally posted by @pcuenca in https://github.com/huggingface/swift-transformers/issues/113#issuecomment-2268897929_
            

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Edge tokenization issues: Unicode parsing #116

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Edge tokenization issues: Unicode parsing #116

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions