Skip to content

Support different tokenizers #1318

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 2, 2025
Merged

Support different tokenizers #1318

merged 1 commit into from
Jul 2, 2025

Conversation

H-Huang
Copy link
Member

@H-Huang H-Huang commented Jun 18, 2025

Requesting feedback since this will change the tokenizer directory for existing users. We need to add support more multiple tokenizers in titan (we currently only have the one used in llama3)

Overview:

Spiritually you can think of the new HuggingFaceTokenizer implementation as an alternative to AutoTokenizer from the transformers library. The OSS models using AutoTokenizer will download a tokenizer data (tokenizer.json, tokenizer.model or vocab.json or vocab.txt, etc.) then apply some configuration (for example in DSv3, the tokenizer always adds "beginning of sentence" token to the encoding)

Changes:

  • download_tokenizer.py will download all the tokenizer related files from hugging face hub
  • Create a new general HuggingFaceTokenizer class, it is inspired from torchtune tokenizer which is a wrapper around hf tokenizer which read a tokenizer_config.json class to set attributes of the tokenizer (e.g. special tokens, adding eos_token to beginning of encoding)
  • Added test to validate that the HuggingFaceTokenizer has the same vocab, encoding, decoding as those from the pretrained Tokenizer and Transformer libraries

For context, here is how the tokenizer is used currently:

Current workflow:

New workflow:

  • Users call scripts/download_tokenizer.py
  • Saves tokenizer configs to a directory named assets/tokenizer/<model_name>/
  • Users uses the tokenizer by referencing the directory mentioned in the previous step in the .toml configs
  • Using hugging face tokenizer lib to load the tokenizer

Pros:

  • Supports additional/existing preloaded tokenizers
  • Can remove tiktoken dependency

Cons:

  • Breaks current users depending on original/tokenizer.model
  • Adds new dependency on HF tokenizer

@H-Huang H-Huang requested a review from tianyu-l June 18, 2025 16:20
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 18, 2025
@H-Huang H-Huang requested a review from wwwjn June 18, 2025 16:20
@H-Huang H-Huang changed the title [WIP] support different tokenizers Support different tokenizers Jun 20, 2025
@H-Huang H-Huang changed the base branch from dsv3-model to deepseek-v3 June 20, 2025 22:12
@H-Huang H-Huang changed the base branch from deepseek-v3 to dsv3-model June 20, 2025 22:13
@H-Huang H-Huang marked this pull request as ready for review June 20, 2025 22:14
Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still trying to comprehend the tokenizer space ...

Can we substitute https://github.com/pytorch/torchtitan/blob/main/torchtitan/models/llama3/__init__.py#L82 with the new function in this PR and get identical results?

I was wondering if the change is general enough, as I saw torchtune has this folder https://github.com/pytorch/torchtune/tree/main/torchtune/modules/transforms/tokenizers

@H-Huang
Copy link
Member Author

H-Huang commented Jun 23, 2025

Hi @tianyu-l, some additional context

I'm still trying to comprehend the tokenizer space ...

Spiritually you can think of this HuggingFaceTokenizer implementation as the alternative to AutoTokenizer from the transformers library. The OSS deepseek model is using AutoTokenizer will download a tokenizer data (tokenizer.json, tokenizer.model or vocab.json or vocab.txt, etc.) then apply some configuration (for example in DSv3, the tokenizer always adds "beginning of sentence" token to the encoding)

The only difference is in our implementation we split up the steps to 1) download, then 2) build. But functionally it should have encoding/decodings which are no different from AutoTokenizer.

Can we substitute https://github.com/pytorch/torchtitan/blob/main/torchtitan/models/llama3/__init__.py#L82 with the new function in this PR and get identical results?

It should, in my unit test i validate that this against DeepSeekv3. But I was denied access to llama3, so im trying to unblock myself there to test

I was wondering if the change is general enough, as I saw torchtune has this folder https://github.com/pytorch/torchtune/tree/main/torchtune/modules/transforms/tokenizers

I might be missing some things but from other tokenizers I tested it also seemed to work. As long as the models we want to support in torchtitan are well tested I think this should be okay? When loading the tokenizer files there are multiple different strategies which will change what tokenizer is loaded based off the files that are available (https://github.com/pytorch/torchtitan/pull/1318/files#diff-87d941cc62c11fe923ef1118a1b2e8319c9126a775b6151fe539e1472d0656a0R92-R136). If I understand correctly that is basically what torchtune is doing, albeit maybe they cover more edge cases?

@fegin
Copy link
Contributor

fegin commented Jun 23, 2025

It should, in my unit test i validate that this against DeepSeekv3. But I was denied access to llama3, so im trying to unblock myself there to test

I could not access llama3 dataset anymore. I unblock myself by accessing the dataset internally and put it to the expect folder. This works with the existing Tokenizer. I would expect that the new one has the same feature (no need to download if one exists).

@H-Huang H-Huang force-pushed the pr-1315 branch 2 times, most recently from 4813c49 to b606478 Compare June 23, 2025 21:51
our_tokenizer, transformers_tokenizer, test_repo_id
)

def test_backward_comptability(self):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

validating that TikTokenizer (old implementation) encode and decodes the same as HuggingFaceTokenizer (new implementation). Will remove this test after we remove TikTokenizer

@H-Huang H-Huang changed the base branch from dsv3-model to deepseek-v3 June 23, 2025 21:56
@H-Huang H-Huang force-pushed the pr-1315 branch 3 times, most recently from 4a6aa00 to 7c306ef Compare June 23, 2025 22:52
Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this PR is not deepseek-specific, I think we can land it in main branch, assuming the deepseek tokenizer works fine in training loop.

We also need to deprecate the old tokenizer approach, could be in a separate PR.
For that, the transition code I added before is at https://github.com/pytorch/torchtitan/blob/main/torchtitan/config_manager.py#L829
We can just update that, substituting both the old and the new.

added_tokens_to_add.append(added_token)

# Process added_tokens_decoder (comprehensive special token definitions)
added_tokens_decoder = self.config.get("added_tokens_decoder", {})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a bit surprised by this: since it's already in the tokenizer_config.json, do we still need to call self.tokenizer.add_special_tokens(added_tokens_to_add) to add them?
I would assume they can directly put them into tokenizer.json or any orginal files...

Copy link
Member Author

@H-Huang H-Huang Jun 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the time these special tokens are already included in the tokenizer.json file.

My interpretation of the tokenizer_config.json is as an overrider. So if you specify these tokens are special then we should make sure that the tokenizer recognizes that.

According to the API (https://huggingface.co/docs/tokenizers/v0.20.3/en/api/tokenizer#tokenizers.Tokenizer.add_special_tokens), "If these tokens are already part of the vocabulary, it just let the Tokenizer know about them. If they don’t exist, the Tokenizer creates them, giving them a new id. " In other words, if they already exist in vocab, then just make sure they are marked as "special" and don't add anything new. Otherwise if they do not exist, increase the vocab and add them. So i think this is pretty safe operation.

The "special" marker is needed because it is later used in skip_special_tokens for the decode part (https://huggingface.co/docs/tokenizers/v0.20.3/en/api/tokenizer#tokenizers.Tokenizer.decode.skip_special_tokens)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great to know, thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what's the most reasonable thing to do, I'd suggest we follow existing file structure for dataloader:

  • keep base tokenizer.py still in torchtitan/components/ folder
  • put hf_tokenizer.py into torchtitan/datasets folder.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think the HuggingFaceTokenizer is general enough to just keep in the tokenizer.py file under components? I feel like it is. It only takes a path in its constructor and doesn't really have any model specific things. Users could extend this tokenizer themselves, or use the BaseTokenizer ABC

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK agree, sounds cleaner

@H-Huang H-Huang changed the base branch from deepseek-v3 to main June 24, 2025 16:43
@H-Huang
Copy link
Member Author

H-Huang commented Jun 24, 2025

Since this PR is not deepseek-specific, I think we can land it in main branch, assuming the deepseek tokenizer works fine in training loop.

Changed the merge into to the main branch

@H-Huang H-Huang requested review from tianyu-l, wwwjn and fegin June 24, 2025 19:49
Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm -- looks very solid!

please address final comments before merge

@@ -8,3 +8,4 @@ tabulate
wandb
fsspec
tyro
tokenizers >= 0.15.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we remove tiktoken?
also please update #1364

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will remove it in the follow up PR #1333

# Step 3: Load tokenizer using official Tokenizer library (if available)
official_tokenizer = None
try:
official_tokenizer = Tokenizer.from_pretrained(test_repo_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be dumb question -- could you remind me of the reasons why we don't just use this method but write our own?

I understand for HF transformers it's too big of a dependency, but we anyways need to depend on HF tokenizers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will download the tokenizer.json and use it, but it does not apply tokenizer_config.json nor does it handle other types of configurations (e.g. vocab.json). Our implementation takes these extra cases into account.

@H-Huang H-Huang merged commit a04f6bd into pytorch:main Jul 2, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants