-
Notifications
You must be signed in to change notification settings - Fork 418
Refactor Tokenizer -> BaseTokenizer #1333
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
240595c
to
76476bc
Compare
76476bc
to
8876a97
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great in general. Left some comments.
@@ -2,7 +2,6 @@ torchdata >= 0.8.0 | |||
datasets >= 3.6.0 | |||
tomli >= 1.1.0 ; python_version < "3.11" | |||
tensorboard | |||
tiktoken |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please update https://github.com/pytorch/torchtitan/blob/main/pyproject.toml as well
@@ -3,3 +3,4 @@ pytest==7.3.2 | |||
pytest-cov | |||
pre-commit | |||
tomli-w >= 1.1.0 | |||
transformers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this for running the unit tests?
@@ -14,7 +14,7 @@ We actively welcome your pull requests. | |||
2. If you've added code that should be tested, add tests. | |||
3. If you've changed APIs, update the documentation. | |||
4. Ensure the test suite passes. | |||
5. Make sure your code lints (`pre-commit run --all-files`). | |||
5. Make sure your code lints (`pre-commit run --from-ref origin/main --to-ref HEAD`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC this is restricting the linting to be changes between current main and the latest commit. Can I ask why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the source of the files under tests/assets/tokenizer
?
asking because not sure about legal side of things
elif os.path.exists(vocab_json_path) or os.path.exists(vocab_txt_path): | ||
# Load vocabulary | ||
if os.path.exists(vocab_json_path): | ||
print("Loading vocabulary from vocab.json") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's use logger.info
instead of print
) from e | ||
# Strategy 2b: Use WordLevel if no merges.txt | ||
else: | ||
print(f"Loading WordLevel tokenizer from {vocab_source}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
last one
@@ -17,7 +17,7 @@ save_tb_folder = "tb" | |||
[model] | |||
name = "llama4" | |||
flavor = "17bx128e" | |||
tokenizer_path = "./assets/tokenizer/tokenizer.model" | |||
tokenizer_path = "./assets/tokenizer/Llama-3.1-8B" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
llama4 tokenizer is different from llama3
See https://github.com/pytorch/torchtitan/blob/main/torchtitan/experiments/llama4/README.md?plain=1#L14
Please also update the README.md
@@ -17,7 +17,7 @@ save_tb_folder = "tb" | |||
[model] | |||
name = "llama4" | |||
flavor = "17bx16e" | |||
tokenizer_path = "./assets/tokenizer/tokenizer.model" | |||
tokenizer_path = "./assets/tokenizer/Llama-3.1-8B" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably should remove / refactor this file too.
But it's out of the scope of this PR.
Let's add a TODO.
@@ -18,7 +18,7 @@ save_tb_folder = "tb" | |||
[model] | |||
name = "llama3" | |||
flavor = "405B" | |||
tokenizer_path = "./assets/tokenizer/original/tokenizer.model" | |||
tokenizer_path = "./assets/tokenizer/meta-llama/Llama-3.1-8B" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how come it's different among three llama3 configs
This causes breaking changes and users will need to redownload the tokenizer files (
python scripts/download_tokenizer.py ...
)tiktoken
dependency, removetiktoken.py
Tokenizer
base classtokenizer.model
tokenizer.model
for tokenizer_path