Refactor Tokenizer -> BaseTokenizer #1333

H-Huang · 2025-06-24T19:37:13Z

This causes breaking changes and users will need to redownload the tokenizer files (python scripts/download_tokenizer.py ...)

Remove tiktoken dependency, remove tiktoken.py
Refactor the Tokenizer base class
Update config files to point to directory instead of tokenizer.model
Raise exception if using tokenizer.model for tokenizer_path

tianyu-l

Looks great in general. Left some comments.

tianyu-l · 2025-07-03T18:07:08Z

.ci/docker/requirements.txt

@@ -2,7 +2,6 @@ torchdata >= 0.8.0
 datasets >= 3.6.0
 tomli >= 1.1.0 ; python_version < "3.11"
 tensorboard
-tiktoken


please update https://github.com/pytorch/torchtitan/blob/main/pyproject.toml as well

tianyu-l · 2025-07-03T22:05:07Z

.ci/docker/requirements-dev.txt

@@ -3,3 +3,4 @@ pytest==7.3.2
 pytest-cov
 pre-commit
 tomli-w >= 1.1.0
+transformers


is this for running the unit tests?

tianyu-l · 2025-07-03T22:08:18Z

CONTRIBUTING.md

@@ -14,7 +14,7 @@ We actively welcome your pull requests.
 2. If you've added code that should be tested, add tests.
 3. If you've changed APIs, update the documentation.
 4. Ensure the test suite passes.
-5. Make sure your code lints (`pre-commit run --all-files`).
+5. Make sure your code lints (`pre-commit run --from-ref origin/main --to-ref HEAD`).


IIUC this is restricting the linting to be changes between current main and the latest commit. Can I ask why?

tianyu-l · 2025-07-03T22:11:08Z

tests/assets/tokenizer/tokenizer.json

what's the source of the files under tests/assets/tokenizer?
asking because not sure about legal side of things

tianyu-l · 2025-07-03T22:15:44Z

torchtitan/components/tokenizer.py

+        elif os.path.exists(vocab_json_path) or os.path.exists(vocab_txt_path):
+            # Load vocabulary
+            if os.path.exists(vocab_json_path):
+                print("Loading vocabulary from vocab.json")


let's use logger.info instead of print

tianyu-l · 2025-07-03T22:16:09Z

torchtitan/components/tokenizer.py

-            ) from e
+            # Strategy 2b: Use WordLevel if no merges.txt
+            else:
+                print(f"Loading WordLevel tokenizer from {vocab_source}")


tianyu-l · 2025-07-03T22:25:12Z

torchtitan/experiments/llama4/train_configs/llama4_17bx128e.toml

@@ -17,7 +17,7 @@ save_tb_folder = "tb"
 [model]
 name = "llama4"
 flavor = "17bx128e"
-tokenizer_path = "./assets/tokenizer/tokenizer.model"
+tokenizer_path = "./assets/tokenizer/Llama-3.1-8B"


llama4 tokenizer is different from llama3
See https://github.com/pytorch/torchtitan/blob/main/torchtitan/experiments/llama4/README.md?plain=1#L14
Please also update the README.md

tianyu-l · 2025-07-03T22:25:17Z

torchtitan/experiments/llama4/train_configs/llama4_17bx16e.toml

@@ -17,7 +17,7 @@ save_tb_folder = "tb"
 [model]
 name = "llama4"
 flavor = "17bx16e"
-tokenizer_path = "./assets/tokenizer/tokenizer.model"
+tokenizer_path = "./assets/tokenizer/Llama-3.1-8B"


tianyu-l · 2025-07-03T22:26:29Z

torchtitan/experiments/multimodal/tokenizer/tiktoken.py

We probably should remove / refactor this file too.
But it's out of the scope of this PR.

Let's add a TODO.

tianyu-l · 2025-07-03T22:27:39Z

torchtitan/models/llama3/train_configs/llama3_405b.toml

@@ -18,7 +18,7 @@ save_tb_folder = "tb"
 [model]
 name = "llama3"
 flavor = "405B"
-tokenizer_path = "./assets/tokenizer/original/tokenizer.model"
+tokenizer_path = "./assets/tokenizer/meta-llama/Llama-3.1-8B"


how come it's different among three llama3 configs

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 24, 2025

H-Huang mentioned this pull request Jul 2, 2025

Support different tokenizers #1318

Merged

H-Huang force-pushed the tokenizer_changes branch 2 times, most recently from 240595c to 76476bc Compare July 2, 2025 18:33

Refactor Tokenizer->BaseTokenizer

8876a97

H-Huang force-pushed the tokenizer_changes branch from 76476bc to 8876a97 Compare July 3, 2025 15:18

H-Huang changed the title ~~[WIP] Refactor Tokenizer -> BaseTokenizer~~ Refactor Tokenizer -> BaseTokenizer Jul 3, 2025

H-Huang marked this pull request as ready for review July 3, 2025 16:20

H-Huang requested review from tianyu-l, fegin, wwwjn and wconstab as code owners July 3, 2025 16:20

tianyu-l reviewed Jul 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor Tokenizer -> BaseTokenizer #1333

Refactor Tokenizer -> BaseTokenizer #1333

H-Huang commented Jun 24, 2025 •

edited

Loading

Uh oh!

tianyu-l left a comment

Uh oh!

tianyu-l Jul 3, 2025

Uh oh!

tianyu-l Jul 3, 2025

Uh oh!

tianyu-l Jul 3, 2025

Uh oh!

tianyu-l Jul 3, 2025

Uh oh!

tianyu-l Jul 3, 2025

Uh oh!

tianyu-l Jul 3, 2025

Uh oh!

tianyu-l Jul 3, 2025

Uh oh!

tianyu-l Jul 3, 2025

Uh oh!

tianyu-l Jul 3, 2025

Uh oh!

tianyu-l Jul 3, 2025

Uh oh!

Uh oh!

Refactor Tokenizer -> BaseTokenizer #1333

Are you sure you want to change the base?

Refactor Tokenizer -> BaseTokenizer #1333

Conversation

H-Huang commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

H-Huang commented Jun 24, 2025 •

edited

Loading