Skip to content

Refactor Tokenizer -> BaseTokenizer #1333

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

H-Huang
Copy link
Member

@H-Huang H-Huang commented Jun 24, 2025

This causes breaking changes and users will need to redownload the tokenizer files (python scripts/download_tokenizer.py ...)

  • Remove tiktoken dependency, remove tiktoken.py
  • Refactor the Tokenizer base class
  • Update config files to point to directory instead of tokenizer.model
  • Raise exception if using tokenizer.model for tokenizer_path

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 24, 2025
@H-Huang H-Huang force-pushed the tokenizer_changes branch 2 times, most recently from 240595c to 76476bc Compare July 2, 2025 18:33
@H-Huang H-Huang force-pushed the tokenizer_changes branch from 76476bc to 8876a97 Compare July 3, 2025 15:18
@H-Huang H-Huang changed the title [WIP] Refactor Tokenizer -> BaseTokenizer Refactor Tokenizer -> BaseTokenizer Jul 3, 2025
@H-Huang H-Huang marked this pull request as ready for review July 3, 2025 16:20
Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great in general. Left some comments.

@@ -2,7 +2,6 @@ torchdata >= 0.8.0
datasets >= 3.6.0
tomli >= 1.1.0 ; python_version < "3.11"
tensorboard
tiktoken
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -3,3 +3,4 @@ pytest==7.3.2
pytest-cov
pre-commit
tomli-w >= 1.1.0
transformers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this for running the unit tests?

@@ -14,7 +14,7 @@ We actively welcome your pull requests.
2. If you've added code that should be tested, add tests.
3. If you've changed APIs, update the documentation.
4. Ensure the test suite passes.
5. Make sure your code lints (`pre-commit run --all-files`).
5. Make sure your code lints (`pre-commit run --from-ref origin/main --to-ref HEAD`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this is restricting the linting to be changes between current main and the latest commit. Can I ask why?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the source of the files under tests/assets/tokenizer?
asking because not sure about legal side of things

elif os.path.exists(vocab_json_path) or os.path.exists(vocab_txt_path):
# Load vocabulary
if os.path.exists(vocab_json_path):
print("Loading vocabulary from vocab.json")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's use logger.info instead of print

) from e
# Strategy 2b: Use WordLevel if no merges.txt
else:
print(f"Loading WordLevel tokenizer from {vocab_source}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

last one

@@ -17,7 +17,7 @@ save_tb_folder = "tb"
[model]
name = "llama4"
flavor = "17bx128e"
tokenizer_path = "./assets/tokenizer/tokenizer.model"
tokenizer_path = "./assets/tokenizer/Llama-3.1-8B"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

llama4 tokenizer is different from llama3
See https://github.com/pytorch/torchtitan/blob/main/torchtitan/experiments/llama4/README.md?plain=1#L14
Please also update the README.md

@@ -17,7 +17,7 @@ save_tb_folder = "tb"
[model]
name = "llama4"
flavor = "17bx16e"
tokenizer_path = "./assets/tokenizer/tokenizer.model"
tokenizer_path = "./assets/tokenizer/Llama-3.1-8B"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably should remove / refactor this file too.
But it's out of the scope of this PR.

Let's add a TODO.

@@ -18,7 +18,7 @@ save_tb_folder = "tb"
[model]
name = "llama3"
flavor = "405B"
tokenizer_path = "./assets/tokenizer/original/tokenizer.model"
tokenizer_path = "./assets/tokenizer/meta-llama/Llama-3.1-8B"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how come it's different among three llama3 configs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants