-
Notifications
You must be signed in to change notification settings - Fork 603
AutoGuess tests #1650
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AutoGuess tests #1650
Conversation
|
Some of the tokenizer configs are pretty big, so perhaps it would be warranted to pull out only the chat template and only saving that. |
|
As for RWKV, the search strings require that "rwkv-world" is present, but it is not in the above example. Maybe "rwkv_tokenizer_end_of_text" is better? Not sure how RWKV World differs from other RWKV's, if at all. |
|
I took a stab at a github workflow to make this trigger in PR's when someone modifies AutoGuess.json but I can't test it. I can drop that commit if it seems broken. |
5f04351 to
01c57d6
Compare
|
OK, I was able to get this running at kallewoof#1 -- Got it working. Force-pushed complete solution. |
01c57d6 to
038328e
Compare
|
I don't think we should bundle those tokenizer configs into the repo. If you're making a workflow for it we can simply download them on demand in the python test script (in fact they don't even need to be stored on disk, just temporarily kept in memory) |
1dbae35 to
15b1034
Compare
|
I guess someone can make a huggingface account and put all the gated tokenizers there? Is that what you were envisioning? |
|
I made a github repository and put the gated tokenizer configs there. The workflow now git clones and uses that repo instead. The one drawback with this is that people can't just run the test without first cloning the gated-tokenizers repo. |
|
Btw, I know this may seem like a lot of work for this relatively minor feature, but (1) I am hoping to also add a check where the apply_chat_template results are compared to the adapter config (edit: see #1654), which will catch mis-configured adapters, and (2) make this into a de facto standard for use elsewhere, e.g. as a basis for the Silly Tavern chat template derivation. |
|
kallewoof#1 updated for reference. |
|
I now have a working follow-up to this in #1654. This is all demonstrated in kallewoof#1:
With these two commits, the Transformers tokenizer |
|
alright thanks give me some time ill take a look |
|
Alright seems to be working fine. As for the RWKV world template I am honestly unsure - it was provided to me by someone else and I did not try it. I'm fine changing it to Also #1627 but this is good enough so I think we can merge so I will merge this first |
This adds a
tests/folder with a singletest_autoguess.pyscript which will return a zero exit code iffand exit code 1 otherwise. It also adds a github workflow that runs this on any pull request which touches
AutoGuess.json.The test currently fails the RWKV model only, which is tested against fla-hub/rwkv7-1.5B-world. I started off with the ambition of finding open ungated models for every template (which meant some odd choices for certain models whose official releases are gated), but gave up in favor of a best-effort approach where gated models' tokenizer configs are saved (as is) in an external github repository that is cloned in the github workflow. You may even argue that this should just include all the templates, and esp. if we put this into CI where the tests will repeatedly download them over and over from HF it may be better to simply push them to git instead.
Results so far: