fix(tokenizer): add <eos> in tokenizer and sequences #63

shenxiangzhuang · 2025-11-25T11:49:59Z

No description provided.

codecov · 2025-11-25T11:52:38Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 95.30%. Comparing base (aa0526e) to head (623043d).

Additional details and impacted files

@@           Coverage Diff           @@
##           master      #63   +/-   ##
=======================================
  Coverage   95.30%   95.30%           
=======================================
  Files          10       10           
  Lines         618      618           
=======================================
  Hits          589      589           
  Misses         29       29

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Pull request overview

This PR adds end-of-sequence (<eos>) token support to the GPT tokenizer and ensures it's properly appended to tokenized sequences during dataset processing.

Key Changes:

Added <eos> to the list of special tokens in GPT configuration
Modified text chunking logic to append <eos> token after each document and pad incomplete chunks
Added comprehensive tests to verify eos token insertion and padding behavior

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File	Description
toynlp/gpt/config.py	Added `<eos>` to the special_tokens list to register it with the tokenizer
toynlp/gpt/dataset.py	Refactored chunking logic to append eos tokens after each text, pad incomplete chunks, and validate required special tokens exist
tests/test_gpt_dataset.py	Added new test file with tests verifying eos token insertion and padding behavior for chunked sequences

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

fix(tokenizer): add <eos> in tokenizer and sequences

05390ee

shenxiangzhuang marked this pull request as draft November 25, 2025 11:50

shenxiangzhuang marked this pull request as ready for review December 8, 2025 05:03

shenxiangzhuang requested a review from Copilot December 8, 2025 05:03

shenxiangzhuang self-assigned this Dec 8, 2025

shenxiangzhuang added the enhancement New feature or request label Dec 8, 2025

Copilot started reviewing on behalf of shenxiangzhuang December 8, 2025 05:03 View session

Copilot AI reviewed Dec 8, 2025

View reviewed changes

shenxiangzhuang added 2 commits December 8, 2025 13:08

Merge branch 'master' into fix/gpt_tokenize

bcbcbc0

update: new model result

623043d

shenxiangzhuang marked this pull request as draft December 8, 2025 05:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(tokenizer): add <eos> in tokenizer and sequences #63

fix(tokenizer): add <eos> in tokenizer and sequences #63

Uh oh!

shenxiangzhuang commented Nov 25, 2025

Uh oh!

codecov bot commented Nov 25, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix(tokenizer): add <eos> in tokenizer and sequences #63

Are you sure you want to change the base?

fix(tokenizer): add <eos> in tokenizer and sequences #63

Uh oh!

Conversation

shenxiangzhuang commented Nov 25, 2025

Uh oh!

codecov bot commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Nov 25, 2025 •

edited

Loading