Skip to content

Conversation

austinleedavis
Copy link

Implement Append Normalizer

Description

This pull request introduces a new Append normalizer to the HuggingFace Tokenizers library. The Append normalizer adds a specified string to the end of input sequences. Its functionality mirrors the existing Prepend normalizer, except that it appends text rather than prepending it.

Motivation

There are use-cases where appending a token or specific character to the end of token sequences is beneficial, particularly when working with special formatting or language modeling tasks. This addition complements the existing functionality and extends the flexibility of the normalization utilities.

Changes Implemented

  • Created a new struct Append analogous to the existing Prepend.
  • Implemented the normalize method to append text to the end of the input.
  • Added relevant serialization/deserialization logic.
  • Included unit tests demonstrating the correct functionality.

Testing

Unit tests have been added, verifying:

  • Correct text appending behavior.
  • Serialization and deserialization consistency.

Example Usage

>>> from tokenizers.normalizers import Append
>>> Append(append="▁").normalize_str("test")
'test▁'

Please let me know if there are additional requirements or improvements needed!

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind adding a new normalizers, but do you have example of real usecases? Like papers or blogs or something?

@austinleedavis
Copy link
Author

austinleedavis commented Jun 20, 2025

This normalizer is very useful when used in conjunction with the Strip normalizer. The Strip normalizer removes all whitespace, while the (proposed) Append normalizer lets you add any string to the end of the text. When combined, they replace any trailing whitespace in the input with a specific, structured ending.

I don't mind adding a new normalizers, but do you have example of real usecases? Like papers or blogs or something?

@ArthurZucker Yes—this feature comes directly from my PhD research. I use the tokenizers library to encode chess game transcripts and train a GPT to play the game. Most chess moves can be represented using exactly two tokens: one for the origin square and one for the destination. However, pawn promotions require a third token (e.g., indicating promotion to queen, rook, etc.), which shifts the GPT’s positional alignment and degrades next-move prediction accuracy for tokens that follow promotions.

To address this, I developed a tokenization scheme that encodes the whitespace following each move as a sort of non-promotion token. The promotion-related token set becomes {" ", "q ", "b ", "r ", "n "}, maintaining positional consistency across moves. This fixes the mid-game misalignment issue, but introduces a new edge case: the final move in a transcript lacks a trailing whitespace token. This issue is resolved by the Append normalizer, since I can add a space at the end of each input sequence.

Publications for reference:

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow sorry for the delay! just missing a .py test and good for me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants