Experimental: tokenizers with and without templates #168

pcuenca · 2025-01-30T11:05:50Z

Pending stuff / edge cases / annoying issues

Strict concurrency errors and warnings. Fix and verify they still work for Swift 5.9
When compiling the package in Xcode, we get the availability warnings because Tokenizers is built without the trait. It works in the CLI with swift build --traits ChatTemplates. I don't know if there's a workaround for Xcode.
Testing builds Tokenizers without the trait, so the chat template tests don't pass. It works in the command-line with swift test --traits ChatTemplates (or swift test --filter TokenizersTests.ChatTemplateTests --traits ChatTemplates to run just the chat template tests).

This builds on top of #166 by @greenrazer.

Two new top-level library products are exposed:

Hub
Tokenizers

(Transformers still exists, and comprises everything in the package including tensor ops and Core ML inference).

The Tokenizers library is heavy when using chat templates, because it requires a Jinja template engine and swift-collections. Thanks to @mattt, we can easily opt-in to using this feature using package traits, which require Swift 6.1. We attempted another solution that was compatible with previous versions of Swift, but we found it to be too unreliable.

How to use Tokenizers.

On Swift < 6.1, chat templates (and the corresponding dependencies) are always available. [email protected] applies.

On Swift > 6.1

Without chat templates, declare the dependency as usual:

dependencies: [
    .package(url: "https://github.com/huggingface/swift-transformers.git", branch: "hub-tokenizers-templates"),
],

Opt-in to chat templates using the ChatTemplates trait:

dependencies: [
    .package(
        url: "https://github.com/huggingface/swift-transformers.git",
        branch: "hub-tokenizers-templates",
        traits: ["ChatTemplates"],
    ),
],

Previous discussion for reference (no longer applies)

The goal is to be able to opt-in to the chat template feature, which carries the Jinja dependency, which in turn depends on swift-collections and whatnot. It works in its current state, but it's ugly – I found way more cross-module quirks than I was expecting. I wanted to make it as easy as possible for consumers and allow to use either version (with or without templates) from their SPM manifests.

Opinions, corrections, and alternative ideas are encouraged and most welcome!

How to use:

If you want to use the core Tokenizers functionality, without the chat templates:

Add the following dependency to your target, as usual (except Tokenizers is now a product, no need to use the full Transformers lib). This applies to projects such as WhisperKit.

            dependencies: [
                .product(name: "Tokenizers", package: "swift-transformers"),
            ],

To opt-in to using chat-templates:

This applies to projects such as mlx-swift-examples.

            dependencies: [
                .product(name: "Tokenizers", package: "swift-transformers"),
+               .product(name: "TokenizersTemplates", package: "swift-transformers"),
            ],

That's it, you don't need to import TokenizersTemplates or do anything other than just declaring it as a dependency.

Known issues:

I haven't looked at tests yet, they probably won't compile.
I know there are conflicts because of last night's tools PR. Let's agree on the general direction before addressing them.

pcuenca · 2025-01-30T11:14:39Z

Sources/Tokenizers/Tokenizer.swift

+        try applyChatTemplate(messages: messages, chatTemplate: .literal(chatTemplate), addGenerationPrompt: true, truncation: false, maxLength: nil, tools: nil)
+    }
+}
+


See comment in TokenizersTemplates

pcuenca · 2025-01-30T11:15:57Z

Sources/TokenizersTemplates/TokenizersTemplates.swift

+]
+
+open class PreTrainedTokenizerWithTemplates : PreTrainedTokenizer {
+    // I don't know why these need to be here. They are implemented in the protocol, **and** in the superclass.


Yes, if these overrides don't exist, the linker can't find the implementations.

pcuenca · 2025-01-30T11:17:21Z

Sources/TokenizersWrapper/TokenizersWrapper.swift

+import Foundation
+import Hub
+
+@_exported import TokenizersCore


This is a good chunk of the magic. The new Tokenizers implementation is just this wrapper file, which exposes the imported TokenizersCore types.

pcuenca · 2025-01-30T11:18:35Z

Sources/TokenizersWrapper/TokenizersWrapper.swift

+#if canImport(TokenizersTemplates)
+import TokenizersTemplates
+public typealias PreTrainedTokenizer = PreTrainedTokenizerWithTemplates
+#endif


So if TokenizersTemplates is available (because users have declared it as a dependency), we override the definition so the factory below uses the subclass.

pcuenca · 2025-01-30T11:19:59Z

Sources/TokenizersWrapper/TokenizersWrapper.swift

+}
+
+// See https://github.com/xenova/transformers.js/blob/1a9964fb09b8f54fcbeac46dc6aae8d76795809d/src/tokenizers.js#L3203 for these exceptions
+class LlamaPreTrainedTokenizer: PreTrainedTokenizer {


This could be moved back to TokenizersCore, but then we'd need a typealias here as well (and a subclass).

pcuenca · 2025-01-30T12:33:25Z

cc @greenrazer @FL33TW00D @Vaibhavs10 for opinions and feedback.

greenrazer · 2025-02-08T00:19:13Z

I like this solution. however, I cannot get the canImport approach in TokenizersWrapper to work for the tests due to how Swift Package Manager handles module dependencies and conditional imports within the same package. All the tests work great with almost no modification (besides changing Tokenizers to TokenizersCore) except for the ChatTemplatesTests.swift.

I've tried a few different options:

The best option: Just adding TokenizersTemplates as a dependency to the test, similar to how you would for an external package. This doesn’t work because it doesn’t trigger recompilation for the test.
Adding swiftSettings: [.define("USE_TEMPLATES")] to the test and then adding another build condition to TokenizersWrapper. Same issue as above—it doesn’t trigger recompilation for the test.
Creating two targets pointing to the same TokenizersWrapper source file, with one including USE_TEMPLATES. This doesn’t work because you can't have two targets referencing the same source file.
Adding .target(name: "TokenizersTemplates", condition: .when(platforms: nil)) as a dependency to hopefully run only during testing. This doesn’t work in newer Swift versions.
Adding swiftSettings: [.define("ENABLE_TEMPLATES", .when(configuration: .debug))] to the Tokenizers target. This almost works, but TokenizersTemplates must then be a dependency of Tokenizers, which defeats the whole purpose.

Since users must explicitly import TokenizersTemplates anyway, I think the best approach is to create two wrappers and two targets. That way, users only need to import either TokenizersTemplates or Tokenizers, depending on their needs.

I found inconsistencies in SPM resolution when using Xcode vs the command line, or when running unit tests, or when using local vs remote dependencies. The auto-magical wrapper was a bit too fragile, or I don't know how to make it work consistently. The gist of the problem is that #canImport runs at compile time, and different modules may or may not be available depending on compilation order.

… dependency in Transformers target

…ailable

…ft-transformers into hub-tokenizers-templates

Tests/TokenizersTests/TrieTests.swift

pcuenca · 2025-08-27T17:48:32Z

@codex review

chatgpt-codex-connector

Codex Review: Here are some suggestions.

Reply with @codex fix comments to fix any unresolved comments.

About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you open a pull request for review, mark a draft as ready, or comment "@codex review". If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex fix this CI failure" or "@codex address that feedback".

chatgpt-codex-connector · 2025-08-27T17:52:00Z

Sources/Tokenizers/Tokenizer.swift

    ) throws -> [Int] {
        if additionalContext == nil {
            try applyChatTemplate(
-                messages: messages, chatTemplate: chatTemplate, addGenerationPrompt: addGenerationPrompt, truncation: truncation, maxLength: maxLength,
+                messages: messages, chatTemplate: chatTemplate,
+                addGenerationPrompt: addGenerationPrompt, truncation: truncation,
+                maxLength: maxLength,
                tools: tools
            )
        } else {


[P0] Missing return in default chat-template overload

The fallback implementation of Tokenizer.applyChatTemplate(..., additionalContext:) calls the six-parameter overload but never returns its result. Because neither branch returns a value, the compiler reports “Missing return in a function expected to return ‘[Int]’” and the module will not build. This branch should return the value from the inner call so the default implementation is usable.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2025-08-27T17:52:00Z

Package.swift

        .executableTarget(
            name: "TransformersCLI",
            dependencies: [
-                "Models", "Generation", "Tokenizers",
-                .product(name: "ArgumentParser", package: "swift-argument-parser"),
+                "Models", .product(name: "ArgumentParser", package: "swift-argument-parser"),


[P0] TransformersCLI target omits Generation dependency

The new TransformersCLI.swift file imports Generation, but the TransformersCLI executable target now only depends on Models and ArgumentParser. Because Generation is not listed, a build will fail with No such module 'Generation', preventing the CLI from compiling. The target still needs a direct dependency on Generation (and any other modules it imports).

Useful? React with 👍 / 👎.

greenrazer and others added 5 commits January 28, 2025 09:29

package.swift test

c5bff79

added tokenizers as a lib

19d9ada

added nessisary target to tokenizers

46327cd

Experimental: tokenizers with and without templates

c80bb71

Simplify

2682309

pcuenca commented Jan 30, 2025

View reviewed changes

pcuenca mentioned this pull request Jan 30, 2025

Experimental: use Tokenizers without templates argmaxinc/WhisperKit#298

Draft

ZachNagengast mentioned this pull request Feb 3, 2025

Can swift-transformers support version 0.1.13 and above and swift-argument-parser support version 1.4.0 and above? argmaxinc/WhisperKit#290

Open

pcuenca mentioned this pull request Feb 11, 2025

Non Sendable Hub.Config #172

Closed

2 tasks

FL33TW00D mentioned this pull request Feb 12, 2025

minimum version of swift-transformers dedicated for tokenizer manipulation ensan-hcl/swift-tokenizers#1

Merged

piotrkowalczuk mentioned this pull request Mar 7, 2025

Sendable Config #189

Merged

8 tasks

pcuenca and others added 10 commits August 16, 2025 10:07

Merge remote-tracking branch 'origin/main' into hub-tokenizers-templates

0961401

Fixing up after merge

05dcbc4

Attempting to fix tests

6567ca3

Minor tweaks for symmetry with previous code

2859d92

Format

ed44ce6

Formatting

166050a

Use package traits to conditionalize chat templates feature and Jinja…

9edcdb5

… dependency in Transformers target

Conditionally add @available annotation when chat templates aren't av…

9b4a32d

…ailable

Update for Swift < 6.1

5c8b4ce

mattt mentioned this pull request Aug 21, 2025

Add VSCode settings and recommended extensions #216

Merged

mattt added 2 commits August 21, 2025 05:56

Formatting

7996322

Formatting

8c92b4e

mattt force-pushed the hub-tokenizers-templates branch from 7996322 to 8c92b4e Compare August 23, 2025 10:23

pcuenca added 2 commits August 25, 2025 13:17

Merge branch 'hub-tokenizers-templates' of github.com:huggingface/swi…

52a4498

…ft-transformers into hub-tokenizers-templates

Compile for Swift 6.1

a8af756

pcuenca commented Aug 25, 2025

View reviewed changes

Tests/TokenizersTests/TrieTests.swift Outdated Show resolved Hide resolved

Remove duplicated ilnes

7199b17

chatgpt-codex-connector bot reviewed Aug 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Experimental: tokenizers with and without templates #168

Experimental: tokenizers with and without templates #168

Uh oh!

pcuenca commented Jan 30, 2025 •

edited

Loading

Uh oh!

pcuenca Jan 30, 2025

Uh oh!

pcuenca Jan 30, 2025

Uh oh!

pcuenca Jan 30, 2025

Uh oh!

pcuenca Jan 30, 2025

Uh oh!

pcuenca Jan 30, 2025

Uh oh!

pcuenca commented Jan 30, 2025

Uh oh!

greenrazer commented Feb 8, 2025

Uh oh!

Uh oh!

pcuenca commented Aug 27, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Aug 27, 2025

Uh oh!

chatgpt-codex-connector bot Aug 27, 2025

Uh oh!

Uh oh!

Experimental: tokenizers with and without templates #168

Are you sure you want to change the base?

Experimental: tokenizers with and without templates #168

Uh oh!

Conversation

pcuenca commented Jan 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pcuenca Jan 30, 2025

Choose a reason for hiding this comment

Uh oh!

pcuenca Jan 30, 2025

Choose a reason for hiding this comment

Uh oh!

pcuenca Jan 30, 2025

Choose a reason for hiding this comment

Uh oh!

pcuenca Jan 30, 2025

Choose a reason for hiding this comment

Uh oh!

pcuenca Jan 30, 2025

Choose a reason for hiding this comment

Uh oh!

pcuenca commented Jan 30, 2025

Uh oh!

greenrazer commented Feb 8, 2025

Uh oh!

Uh oh!

pcuenca commented Aug 27, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pcuenca commented Jan 30, 2025 •

edited

Loading