Support different tokenizers #1318

H-Huang · 2025-06-18T16:20:25Z

Requesting feedback since this will change the tokenizer directory for existing users. We need to add support more multiple tokenizers in titan (we currently only have the one used in llama3)

Overview:

Spiritually you can think of the new HuggingFaceTokenizer implementation as an alternative to AutoTokenizer from the transformers library. The OSS models using AutoTokenizer will download a tokenizer data (tokenizer.json, tokenizer.model or vocab.json or vocab.txt, etc.) then apply some configuration (for example in DSv3, the tokenizer always adds "beginning of sentence" token to the encoding)

Changes:

download_tokenizer.py will download all the tokenizer related files from hugging face hub
Create a new general HuggingFaceTokenizer class, it is inspired from torchtune tokenizer which is a wrapper around hf tokenizer which read a tokenizer_config.json class to set attributes of the tokenizer (e.g. special tokens, adding eos_token to beginning of encoding)
Added test to validate that the HuggingFaceTokenizer has the same vocab, encoding, decoding as those from the pretrained Tokenizer and Transformer libraries

For context, here is how the tokenizer is used currently:

Current workflow:

Users call scripts/download_tokenizer.py
Saves the tokenizer to assets/tokenizer/original/tokenizer.model
Users uses the tokenizer by adding tokenizer.model in the .toml configs (https://github.com/pytorch/torchtitan/blob/main/torchtitan/models/llama3/train_configs/llama3_8b.toml#L21
We retrieve tokenizer.model and wrap with our own tiktoken tokenizer (https://github.com/pytorch/torchtitan/blob/main/torchtitan/datasets/tokenizer/tiktoken.py)

New workflow:

Users call scripts/download_tokenizer.py
Saves tokenizer configs to a directory named assets/tokenizer/<model_name>/
Users uses the tokenizer by referencing the directory mentioned in the previous step in the .toml configs
Using hugging face tokenizer lib to load the tokenizer

Pros:

Supports additional/existing preloaded tokenizers
Can remove tiktoken dependency

Cons:

Breaks current users depending on original/tokenizer.model
Adds new dependency on HF tokenizer

scripts/use_tokenizer_example.py

scripts/download_tokenizer.py

README.md

tianyu-l

I'm still trying to comprehend the tokenizer space ...

Can we substitute https://github.com/pytorch/torchtitan/blob/main/torchtitan/models/llama3/__init__.py#L82 with the new function in this PR and get identical results?

I was wondering if the change is general enough, as I saw torchtune has this folder https://github.com/pytorch/torchtune/tree/main/torchtune/modules/transforms/tokenizers

tests/unit_tests/test_tokenizer.py

torchtitan/components/tokenizer.py

H-Huang · 2025-06-23T15:23:01Z

Hi @tianyu-l, some additional context

I'm still trying to comprehend the tokenizer space ...

Spiritually you can think of this HuggingFaceTokenizer implementation as the alternative to AutoTokenizer from the transformers library. The OSS deepseek model is using AutoTokenizer will download a tokenizer data (tokenizer.json, tokenizer.model or vocab.json or vocab.txt, etc.) then apply some configuration (for example in DSv3, the tokenizer always adds "beginning of sentence" token to the encoding)

The only difference is in our implementation we split up the steps to 1) download, then 2) build. But functionally it should have encoding/decodings which are no different from AutoTokenizer.

Can we substitute https://github.com/pytorch/torchtitan/blob/main/torchtitan/models/llama3/__init__.py#L82 with the new function in this PR and get identical results?

It should, in my unit test i validate that this against DeepSeekv3. But I was denied access to llama3, so im trying to unblock myself there to test

I was wondering if the change is general enough, as I saw torchtune has this folder https://github.com/pytorch/torchtune/tree/main/torchtune/modules/transforms/tokenizers

I might be missing some things but from other tokenizers I tested it also seemed to work. As long as the models we want to support in torchtitan are well tested I think this should be okay? When loading the tokenizer files there are multiple different strategies which will change what tokenizer is loaded based off the files that are available (https://github.com/pytorch/torchtitan/pull/1318/files#diff-87d941cc62c11fe923ef1118a1b2e8319c9126a775b6151fe539e1472d0656a0R92-R136). If I understand correctly that is basically what torchtune is doing, albeit maybe they cover more edge cases?

fegin · 2025-06-23T16:59:42Z

It should, in my unit test i validate that this against DeepSeekv3. But I was denied access to llama3, so im trying to unblock myself there to test

I could not access llama3 dataset anymore. I unblock myself by accessing the dataset internally and put it to the expect folder. This works with the existing Tokenizer. I would expect that the new one has the same feature (no need to download if one exists).

H-Huang · 2025-06-23T21:52:51Z

tests/unit_tests/test_tokenizer.py

+                our_tokenizer, transformers_tokenizer, test_repo_id
+            )
+
+    def test_backward_comptability(self):


validating that TikTokenizer (old implementation) encode and decodes the same as HuggingFaceTokenizer (new implementation). Will remove this test after we remove TikTokenizer

tianyu-l

Since this PR is not deepseek-specific, I think we can land it in main branch, assuming the deepseek tokenizer works fine in training loop.

We also need to deprecate the old tokenizer approach, could be in a separate PR.
For that, the transition code I added before is at https://github.com/pytorch/torchtitan/blob/main/torchtitan/config_manager.py#L829
We can just update that, substituting both the old and the new.

torchtitan/components/tokenizer.py

tianyu-l · 2025-06-24T00:19:35Z

torchtitan/components/tokenizer.py

+                    added_tokens_to_add.append(added_token)
+
+        # Process added_tokens_decoder (comprehensive special token definitions)
+        added_tokens_decoder = self.config.get("added_tokens_decoder", {})


a bit surprised by this: since it's already in the tokenizer_config.json, do we still need to call self.tokenizer.add_special_tokens(added_tokens_to_add) to add them?
I would assume they can directly put them into tokenizer.json or any orginal files...

Most of the time these special tokens are already included in the tokenizer.json file.

My interpretation of the tokenizer_config.json is as an overrider. So if you specify these tokens are special then we should make sure that the tokenizer recognizes that.

According to the API (https://huggingface.co/docs/tokenizers/v0.20.3/en/api/tokenizer#tokenizers.Tokenizer.add_special_tokens), "If these tokens are already part of the vocabulary, it just let the Tokenizer know about them. If they don’t exist, the Tokenizer creates them, giving them a new id. " In other words, if they already exist in vocab, then just make sure they are marked as "special" and don't add anything new. Otherwise if they do not exist, increase the vocab and add them. So i think this is pretty safe operation.

The "special" marker is needed because it is later used in skip_special_tokens for the decode part (https://huggingface.co/docs/tokenizers/v0.20.3/en/api/tokenizer#tokenizers.Tokenizer.decode.skip_special_tokens)

great to know, thanks!

tianyu-l · 2025-06-24T00:28:36Z

torchtitan/components/tokenizer.py

Not sure what's the most reasonable thing to do, I'd suggest we follow existing file structure for dataloader:

keep base tokenizer.py still in torchtitan/components/ folder

put hf_tokenizer.py into torchtitan/datasets folder.

Do you think the HuggingFaceTokenizer is general enough to just keep in the tokenizer.py file under components? I feel like it is. It only takes a path in its constructor and doesn't really have any model specific things. Users could extend this tokenizer themselves, or use the BaseTokenizer ABC

OK agree, sounds cleaner

tests/unit_tests/test_tokenizer.py

README.md

H-Huang · 2025-06-24T16:44:24Z

Since this PR is not deepseek-specific, I think we can land it in main branch, assuming the deepseek tokenizer works fine in training loop.

Changed the merge into to the main branch

scripts/download_tokenizer.py

tests/unit_tests/test_tokenizer.py

torchtitan/components/tokenizer.py

tianyu-l

lgtm -- looks very solid!

please address final comments before merge

tianyu-l · 2025-07-02T06:16:29Z

.ci/docker/requirements.txt

@@ -8,3 +8,4 @@ tabulate
 wandb
 fsspec
 tyro
+tokenizers >= 0.15.0


can we remove tiktoken?
also please update #1364

Will remove it in the follow up PR #1333

scripts/download_tokenizer.py

tianyu-l · 2025-07-02T06:32:19Z

tests/unit_tests/test_tokenizer.py

+        # Step 3: Load tokenizer using official Tokenizer library (if available)
+        official_tokenizer = None
+        try:
+            official_tokenizer = Tokenizer.from_pretrained(test_repo_id)


might be dumb question -- could you remind me of the reasons why we don't just use this method but write our own?

I understand for HF transformers it's too big of a dependency, but we anyways need to depend on HF tokenizers.

This will download the tokenizer.json and use it, but it does not apply tokenizer_config.json nor does it handle other types of configurations (e.g. vocab.json). Our implementation takes these extra cases into account.

H-Huang requested a review from tianyu-l June 18, 2025 16:20

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 18, 2025

H-Huang requested a review from wwwjn June 18, 2025 16:20

H-Huang commented Jun 18, 2025

View reviewed changes

scripts/use_tokenizer_example.py Outdated Show resolved Hide resolved

wwwjn reviewed Jun 18, 2025

View reviewed changes

scripts/download_tokenizer.py Outdated Show resolved Hide resolved

fegin reviewed Jun 18, 2025

View reviewed changes

README.md Outdated Show resolved Hide resolved

H-Huang force-pushed the pr-1315 branch from f6ab45f to 6115ea0 Compare June 20, 2025 22:05

H-Huang changed the title ~~[WIP] support different tokenizers~~ Support different tokenizers Jun 20, 2025

H-Huang changed the base branch from dsv3-model to deepseek-v3 June 20, 2025 22:12

H-Huang changed the base branch from deepseek-v3 to dsv3-model June 20, 2025 22:13

H-Huang force-pushed the pr-1315 branch from 6115ea0 to 6999210 Compare June 20, 2025 22:14

H-Huang marked this pull request as ready for review June 20, 2025 22:14

tianyu-l reviewed Jun 20, 2025

View reviewed changes

H-Huang force-pushed the pr-1315 branch from 6999210 to a2f64fd Compare June 23, 2025 15:26

H-Huang force-pushed the pr-1315 branch 2 times, most recently from 4813c49 to b606478 Compare June 23, 2025 21:51

H-Huang commented Jun 23, 2025

View reviewed changes

H-Huang changed the base branch from dsv3-model to deepseek-v3 June 23, 2025 21:56

H-Huang force-pushed the pr-1315 branch 3 times, most recently from 4a6aa00 to 7c306ef Compare June 23, 2025 22:52

tianyu-l reviewed Jun 24, 2025

View reviewed changes

H-Huang force-pushed the pr-1315 branch from 7c306ef to 23d9b54 Compare June 24, 2025 16:42

H-Huang changed the base branch from deepseek-v3 to main June 24, 2025 16:43

Support different tokenizers

42724d3

H-Huang force-pushed the pr-1315 branch from 23d9b54 to 42724d3 Compare June 24, 2025 19:33

H-Huang mentioned this pull request Jun 24, 2025

Refactor Tokenizer -> BaseTokenizer #1333

Open

H-Huang requested review from tianyu-l, wwwjn and fegin June 24, 2025 19:49

wwwjn reviewed Jul 1, 2025

View reviewed changes

scripts/download_tokenizer.py Show resolved Hide resolved

tests/unit_tests/test_tokenizer.py Show resolved Hide resolved

tests/unit_tests/test_tokenizer.py Show resolved Hide resolved

torchtitan/components/tokenizer.py Show resolved Hide resolved

tianyu-l approved these changes Jul 2, 2025

View reviewed changes

H-Huang merged commit a04f6bd into pytorch:main Jul 2, 2025
8 checks passed

wwwjn mentioned this pull request Jul 2, 2025

[DSv3] Compile support for single GPU #1364

Closed

Support different tokenizers #1318

Support different tokenizers #1318

Uh oh!

Conversation

H-Huang commented Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

H-Huang commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fegin commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

H-Huang Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

H-Huang commented Jun 24, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

H-Huang commented Jun 18, 2025 •

edited

Loading

H-Huang commented Jun 23, 2025 •

edited

Loading

fegin commented Jun 23, 2025 •

edited

Loading

H-Huang Jun 24, 2025 •

edited

Loading