Implement Append normalizer #1755

austinleedavis · 2025-03-24T19:08:08Z

Implement Append Normalizer

Description

This pull request introduces a new Append normalizer to the HuggingFace Tokenizers library. The Append normalizer adds a specified string to the end of input sequences. Its functionality mirrors the existing Prepend normalizer, except that it appends text rather than prepending it.

Motivation

There are use-cases where appending a token or specific character to the end of token sequences is beneficial, particularly when working with special formatting or language modeling tasks. This addition complements the existing functionality and extends the flexibility of the normalization utilities.

Changes Implemented

Created a new struct Append analogous to the existing Prepend.
Implemented the normalize method to append text to the end of the input.
Added relevant serialization/deserialization logic.
Included unit tests demonstrating the correct functionality.

Testing

Unit tests have been added, verifying:

Correct text appending behavior.
Serialization and deserialization consistency.

Example Usage

>>> from tokenizers.normalizers import Append
>>> Append(append="▁").normalize_str("test")
'test▁'

Please let me know if there are additional requirements or improvements needed!

ArthurZucker

I don't mind adding a new normalizers, but do you have example of real usecases? Like papers or blogs or something?

austinleedavis · 2025-06-20T22:55:05Z

This normalizer is very useful when used in conjunction with the Strip normalizer. The Strip normalizer removes all whitespace, while the (proposed) Append normalizer lets you add any string to the end of the text. When combined, they replace any trailing whitespace in the input with a specific, structured ending.

I don't mind adding a new normalizers, but do you have example of real usecases? Like papers or blogs or something?

@ArthurZucker Yes—this feature comes directly from my PhD research. I use the tokenizers library to encode chess game transcripts and train a GPT to play the game. Most chess moves can be represented using exactly two tokens: one for the origin square and one for the destination. However, pawn promotions require a third token (e.g., indicating promotion to queen, rook, etc.), which shifts the GPT’s positional alignment and degrades next-move prediction accuracy for tokens that follow promotions.

To address this, I developed a tokenization scheme that encodes the whitespace following each move as a sort of non-promotion token. The promotion-related token set becomes {" ", "q ", "b ", "r ", "n "}, maintaining positional consistency across moves. This fixes the mid-game misalignment issue, but introduces a new edge case: the final move in a transcript lacks a trailing whitespace token. This issue is resolved by the Append normalizer, since I can add a space at the end of each input sequence.

Publications for reference:

Davis, A. L., & Sukthankar, G. (2025). Scaling Effects on Latent Representation Edits in GPT Models (Student Abstract). Proceedings of the AAAI Conference on Artificial Intelligence, 39(28), 29343-29344. https://doi.org/10.1609/aaai.v39i28.35245
Davis, Austin & Sukthankar, Gita. (2024). Hidden Pieces: An Analysis of Linear Probes for GPT Representation Edits. 498-505. https://doi.org/10.1109/ICMLA61862.2024.00073

ArthurZucker

Wow sorry for the delay! just missing a .py test and good for me

austinleedavis added 8 commits March 24, 2025 14:24

Create append.rs

e51f8d2

Add Append

94a744b

Add Append

891e34a

Add Append

1ef7fba

Add Append

b157c3e

Update README.md

11aa7a6

Add Append

d52c15d

Update README.md

6e90602

ArthurZucker reviewed May 27, 2025

View reviewed changes

Merge branch 'main' into main

0b4e5f6

ArthurZucker approved these changes Jul 29, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement Append normalizer #1755

Implement Append normalizer #1755

Uh oh!

austinleedavis commented Mar 24, 2025

Uh oh!

ArthurZucker left a comment

Uh oh!

austinleedavis commented Jun 20, 2025 •

edited

Loading

Uh oh!

ArthurZucker left a comment

Uh oh!

Uh oh!

Implement Append normalizer #1755

Are you sure you want to change the base?

Implement Append normalizer #1755

Uh oh!

Conversation

austinleedavis commented Mar 24, 2025

Implement Append Normalizer

Description

Motivation

Changes Implemented

Testing

Example Usage

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

austinleedavis commented Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Publications for reference:

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

austinleedavis commented Jun 20, 2025 •

edited

Loading