Skip to content

Conversation

@shenxiangzhuang
Copy link
Collaborator

No description provided.

@shenxiangzhuang shenxiangzhuang marked this pull request as draft November 25, 2025 11:50
@codecov
Copy link

codecov bot commented Nov 25, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 95.30%. Comparing base (aa0526e) to head (623043d).

Additional details and impacted files
@@           Coverage Diff           @@
##           master      #63   +/-   ##
=======================================
  Coverage   95.30%   95.30%           
=======================================
  Files          10       10           
  Lines         618      618           
=======================================
  Hits          589      589           
  Misses         29       29           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@shenxiangzhuang shenxiangzhuang marked this pull request as ready for review December 8, 2025 05:03
@shenxiangzhuang shenxiangzhuang self-assigned this Dec 8, 2025
@shenxiangzhuang shenxiangzhuang added the enhancement New feature or request label Dec 8, 2025
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds end-of-sequence (<eos>) token support to the GPT tokenizer and ensures it's properly appended to tokenized sequences during dataset processing.

Key Changes:

  • Added <eos> to the list of special tokens in GPT configuration
  • Modified text chunking logic to append <eos> token after each document and pad incomplete chunks
  • Added comprehensive tests to verify eos token insertion and padding behavior

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
toynlp/gpt/config.py Added <eos> to the special_tokens list to register it with the tokenizer
toynlp/gpt/dataset.py Refactored chunking logic to append eos tokens after each text, pad incomplete chunks, and validate required special tokens exist
tests/test_gpt_dataset.py Added new test file with tests verifying eos token insertion and padding behavior for chunked sequences

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@shenxiangzhuang shenxiangzhuang marked this pull request as draft December 8, 2025 05:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants