-
Notifications
You must be signed in to change notification settings - Fork 419
Support different tokenizers #1318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still trying to comprehend the tokenizer space ...
Can we substitute https://github.com/pytorch/torchtitan/blob/main/torchtitan/models/llama3/__init__.py#L82 with the new function in this PR and get identical results?
I was wondering if the change is general enough, as I saw torchtune has this folder https://github.com/pytorch/torchtune/tree/main/torchtune/modules/transforms/tokenizers
Hi @tianyu-l, some additional context
Spiritually you can think of this The only difference is in our implementation we split up the steps to 1)
It should, in my unit test i validate that this against DeepSeekv3. But I was denied access to llama3, so im trying to unblock myself there to test
I might be missing some things but from other tokenizers I tested it also seemed to work. As long as the models we want to support in torchtitan are well tested I think this should be okay? When loading the tokenizer files there are multiple different strategies which will change what tokenizer is loaded based off the files that are available (https://github.com/pytorch/torchtitan/pull/1318/files#diff-87d941cc62c11fe923ef1118a1b2e8319c9126a775b6151fe539e1472d0656a0R92-R136). If I understand correctly that is basically what torchtune is doing, albeit maybe they cover more edge cases? |
I could not access llama3 dataset anymore. I unblock myself by accessing the dataset internally and put it to the expect folder. This works with the existing Tokenizer. I would expect that the new one has the same feature (no need to download if one exists). |
4813c49
to
b606478
Compare
our_tokenizer, transformers_tokenizer, test_repo_id | ||
) | ||
|
||
def test_backward_comptability(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
validating that TikTokenizer
(old implementation) encode and decodes the same as HuggingFaceTokenizer
(new implementation). Will remove this test after we remove TikTokenizer
4a6aa00
to
7c306ef
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this PR is not deepseek-specific, I think we can land it in main branch, assuming the deepseek tokenizer works fine in training loop.
We also need to deprecate the old tokenizer approach, could be in a separate PR.
For that, the transition code I added before is at https://github.com/pytorch/torchtitan/blob/main/torchtitan/config_manager.py#L829
We can just update that, substituting both the old and the new.
added_tokens_to_add.append(added_token) | ||
|
||
# Process added_tokens_decoder (comprehensive special token definitions) | ||
added_tokens_decoder = self.config.get("added_tokens_decoder", {}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a bit surprised by this: since it's already in the tokenizer_config.json
, do we still need to call self.tokenizer.add_special_tokens(added_tokens_to_add)
to add them?
I would assume they can directly put them into tokenizer.json
or any orginal files...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the time these special tokens are already included in the tokenizer.json
file.
My interpretation of the tokenizer_config.json
is as an overrider. So if you specify these tokens are special then we should make sure that the tokenizer recognizes that.
According to the API (https://huggingface.co/docs/tokenizers/v0.20.3/en/api/tokenizer#tokenizers.Tokenizer.add_special_tokens), "If these tokens are already part of the vocabulary, it just let the Tokenizer know about them. If they don’t exist, the Tokenizer creates them, giving them a new id. " In other words, if they already exist in vocab, then just make sure they are marked as "special" and don't add anything new. Otherwise if they do not exist, increase the vocab and add them. So i think this is pretty safe operation.
The "special" marker is needed because it is later used in skip_special_tokens
for the decode part (https://huggingface.co/docs/tokenizers/v0.20.3/en/api/tokenizer#tokenizers.Tokenizer.decode.skip_special_tokens)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great to know, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what's the most reasonable thing to do, I'd suggest we follow existing file structure for dataloader:
- keep base
tokenizer.py
still intorchtitan/components/
folder - put
hf_tokenizer.py
intotorchtitan/datasets
folder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think the HuggingFaceTokenizer
is general enough to just keep in the tokenizer.py
file under components? I feel like it is. It only takes a path in its constructor and doesn't really have any model specific things. Users could extend this tokenizer themselves, or use the BaseTokenizer ABC
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK agree, sounds cleaner
Changed the merge into to the main branch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm -- looks very solid!
please address final comments before merge
@@ -8,3 +8,4 @@ tabulate | |||
wandb | |||
fsspec | |||
tyro | |||
tokenizers >= 0.15.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we remove tiktoken
?
also please update #1364
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will remove it in the follow up PR #1333
# Step 3: Load tokenizer using official Tokenizer library (if available) | ||
official_tokenizer = None | ||
try: | ||
official_tokenizer = Tokenizer.from_pretrained(test_repo_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be dumb question -- could you remind me of the reasons why we don't just use this method but write our own?
I understand for HF transformers
it's too big of a dependency, but we anyways need to depend on HF tokenizers
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will download the tokenizer.json
and use it, but it does not apply tokenizer_config.json
nor does it handle other types of configurations (e.g. vocab.json
). Our implementation takes these extra cases into account.
Requesting feedback since this will change the tokenizer directory for existing users. We need to add support more multiple tokenizers in titan (we currently only have the one used in llama3)
Overview:
Spiritually you can think of the new
HuggingFaceTokenizer
implementation as an alternative toAutoTokenizer
from thetransformers
library. The OSS models using AutoTokenizer will download a tokenizer data (tokenizer.json, tokenizer.model or vocab.json or vocab.txt, etc.) then apply some configuration (for example in DSv3, the tokenizer always adds "beginning of sentence" token to the encoding)Changes:
download_tokenizer.py
will download all the tokenizer related files from hugging face hubHuggingFaceTokenizer
class, it is inspired from torchtune tokenizer which is a wrapper around hf tokenizer which read atokenizer_config.json
class to set attributes of the tokenizer (e.g. special tokens, adding eos_token to beginning of encoding)HuggingFaceTokenizer
has the same vocab, encoding, decoding as those from the pretrainedTokenizer
andTransformer
librariesFor context, here is how the tokenizer is used currently:
Current workflow:
scripts/download_tokenizer.py
assets/tokenizer/original/tokenizer.model
tokenizer.model
in the.toml
configs (https://github.com/pytorch/torchtitan/blob/main/torchtitan/models/llama3/train_configs/llama3_8b.toml#L21tokenizer.model
and wrap with our owntiktoken
tokenizer (https://github.com/pytorch/torchtitan/blob/main/torchtitan/datasets/tokenizer/tiktoken.py)New workflow:
scripts/download_tokenizer.py
assets/tokenizer/<model_name>/
.toml
configsPros:
Cons:
original/tokenizer.model