-
Notifications
You must be signed in to change notification settings - Fork 132
Experimental: tokenizers with and without templates #168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
try applyChatTemplate(messages: messages, chatTemplate: .literal(chatTemplate), addGenerationPrompt: true, truncation: false, maxLength: nil, tools: nil) | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment in TokenizersTemplates
] | ||
|
||
open class PreTrainedTokenizerWithTemplates : PreTrainedTokenizer { | ||
// I don't know why these need to be here. They are implemented in the protocol, **and** in the superclass. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if these overrides don't exist, the linker can't find the implementations.
import Foundation | ||
import Hub | ||
|
||
@_exported import TokenizersCore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good chunk of the magic. The new Tokenizers
implementation is just this wrapper file, which exposes the imported TokenizersCore
types.
#if canImport(TokenizersTemplates) | ||
import TokenizersTemplates | ||
public typealias PreTrainedTokenizer = PreTrainedTokenizerWithTemplates | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if TokenizersTemplates
is available (because users have declared it as a dependency), we override the definition so the factory below uses the subclass.
} | ||
|
||
// See https://github.com/xenova/transformers.js/blob/1a9964fb09b8f54fcbeac46dc6aae8d76795809d/src/tokenizers.js#L3203 for these exceptions | ||
class LlamaPreTrainedTokenizer: PreTrainedTokenizer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be moved back to TokenizersCore
, but then we'd need a typealias
here as well (and a subclass).
cc @greenrazer @FL33TW00D @Vaibhavs10 for opinions and feedback. |
I like this solution. however, I cannot get the I've tried a few different options:
Since users must explicitly import |
I found inconsistencies in SPM resolution when using Xcode vs the command line, or when running unit tests, or when using local vs remote dependencies. The auto-magical wrapper was a bit too fragile, or I don't know how to make it work consistently. The gist of the problem is that #canImport runs at compile time, and different modules may or may not be available depending on compilation order.
… dependency in Transformers target
7996322
to
8c92b4e
Compare
…ft-transformers into hub-tokenizers-templates
@codex review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codex Review: Here are some suggestions.
Reply with @codex fix comments
to fix any unresolved comments.
About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you open a pull request for review, mark a draft as ready, or comment "@codex review". If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex fix this CI failure" or "@codex address that feedback".
) throws -> [Int] { | ||
if additionalContext == nil { | ||
try applyChatTemplate( | ||
messages: messages, chatTemplate: chatTemplate, addGenerationPrompt: addGenerationPrompt, truncation: truncation, maxLength: maxLength, | ||
messages: messages, chatTemplate: chatTemplate, | ||
addGenerationPrompt: addGenerationPrompt, truncation: truncation, | ||
maxLength: maxLength, | ||
tools: tools | ||
) | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[P0] Missing return in default chat-template overload
The fallback implementation of Tokenizer.applyChatTemplate(..., additionalContext:)
calls the six-parameter overload but never returns its result. Because neither branch returns a value, the compiler reports “Missing return in a function expected to return ‘[Int]’” and the module will not build. This branch should return the value from the inner call so the default implementation is usable.
Useful? React with 👍 / 👎.
.executableTarget( | ||
name: "TransformersCLI", | ||
dependencies: [ | ||
"Models", "Generation", "Tokenizers", | ||
.product(name: "ArgumentParser", package: "swift-argument-parser"), | ||
"Models", .product(name: "ArgumentParser", package: "swift-argument-parser"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[P0] TransformersCLI target omits Generation dependency
The new TransformersCLI.swift
file imports Generation
, but the TransformersCLI
executable target now only depends on Models
and ArgumentParser
. Because Generation
is not listed, a build will fail with No such module 'Generation'
, preventing the CLI from compiling. The target still needs a direct dependency on Generation
(and any other modules it imports).
Useful? React with 👍 / 👎.
Pending stuff / edge cases / annoying issues
Tokenizers
is built without the trait. It works in the CLI withswift build --traits ChatTemplates
. I don't know if there's a workaround for Xcode.Tokenizers
without the trait, so the chat template tests don't pass. It works in the command-line withswift test --traits ChatTemplates
(orswift test --filter TokenizersTests.ChatTemplateTests --traits ChatTemplates
to run just the chat template tests).This builds on top of #166 by @greenrazer.
Two new top-level library products are exposed:
Hub
Tokenizers
(
Transformers
still exists, and comprises everything in the package including tensor ops and Core ML inference).The
Tokenizers
library is heavy when using chat templates, because it requires a Jinja template engine and swift-collections. Thanks to @mattt, we can easily opt-in to using this feature using package traits, which require Swift 6.1. We attempted another solution that was compatible with previous versions of Swift, but we found it to be too unreliable.How to use
Tokenizers
.[email protected]
applies.ChatTemplates
trait:Previous discussion for reference (no longer applies)
The goal is to be able to opt-in to the chat template feature, which carries the Jinja dependency, which in turn depends on swift-collections and whatnot. It works in its current state, but it's ugly – I found way more cross-module quirks than I was expecting. I wanted to make it as easy as possible for consumers and allow to use either version (with or without templates) from their SPM manifests.
Opinions, corrections, and alternative ideas are encouraged and most welcome!
How to use:
Add the following dependency to your target, as usual (except
Tokenizers
is now a product, no need to use the fullTransformers
lib). This applies to projects such as WhisperKit.This applies to projects such as mlx-swift-examples.
dependencies: [ .product(name: "Tokenizers", package: "swift-transformers"), + .product(name: "TokenizersTemplates", package: "swift-transformers"), ],
That's it, you don't need to import
TokenizersTemplates
or do anything other than just declaring it as a dependency.Known issues: