Skip to content

Polly-LYP/TOSS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

[ICLR'26] Token-level Data Selection for Safe LLM Fine-tuning

Official PyTorch implementation of our ICLR 2026 paper:

Token-level Data Selection for Safe LLM Fine-tuning

Yanping Li · Zhening Liu · Zijian Li · Zehong Lin · Jun Zhang

[ArXiv Preprint]

🔥 News

  • March 29, 2026: 🔥 We release our Python code and checkpoints for TOSS presented in our paper. Have a try!

  • Jan 23, 2026: 🌟 Our paper has been accepted by ICLR 2026! 🎉

⭐ Overview

Fine-tuning large language models (LLMs) on custom datasets has become a standard approach for adapting these models to specific domains and applications. However, recent studies have shown that such fine-tuning can lead to significant degradation in the model's safety. Existing defense methods operate at the sample level and often suffer from an unsatisfactory trade-off between safety and utility. To address this limitation, we perform a systematic token-level diagnosis of safety degradation during fine-tuning. Based on this, we propose token-level data selection for safe LLM fine-tuning (TOSS), a novel framework that quantifies the safety risk of each token by measuring the loss difference between a safety-degraded model and a utility-oriented model. This token-level granularity enables accurate identification and removal of unsafe tokens, thereby preserving valuable task-specific information. In addition, we introduce a progressive refinement strategy, TOSS-Pro, which iteratively enhances the safety-degraded model's ability to identify unsafe tokens. Extensive experiments demonstrate that our approach robustly safeguards LLMs during fine-tuning while achieving superior downstream task performance, significantly outperforming existing sample-level defense methods.

Pipeline

✨Performance Illustration

We illustrate the safety and utility performance of our method. For more details, please check the official paper.

performance on safety and utility benchmarks

Visualization of models' output

✨ Quick Start

After cloning the repository, follow these steps to train and run inference.

Requirements

Install dependencies with pip install -r requirements.txt. Install local packages with:

pip install -e ./train/OpenRLHFBase
pip install -e ./train

If README figures are not displayed, place image files under assets/ (e.g., assets/pipeline.png).

Train

  • Train reference model

Utilize the scripts ./scripts/train_safety_model.sh to train the safety-degrading model.

Utilize the scripts ./scripts/train_utility_model.sh to train the utility-oriented model.

  • Token selection

Utilize the scripts ./scripts/safety_model_calculate_loss.sh and ./scripts/utility_model_calculate_loss.sh to obtain the loss files.

Utilize the scripts ./scripts/generate_token_mask.sh to generate the token masking matrix.

  • Token-level selective training

Utilize the scripts ./scripts/selective_finetuning.sh to perform selective fine-tuning.

Evaluation

Our evaluation method follows SEAL

Models

We also provide the models trained with our method on Hugging Face.

Model Link
safety-degrading model https://huggingface.co/Polly1231/safety_degrading_model
utility-oriented model https://huggingface.co/Polly1231/utility_oriented_model
TOSS https://huggingface.co/Polly1231/TOSS

🤝 Acknowledgments

We would like to appreciate the following fantastic open-source works contributing to the implementation.

✏️ Citation

If you find TOSS useful, please cite our paper:

@article{li2026token,
  title={Token-level Data Selection for Safe LLM Fine-tuning},
  author={Li, Yanping and Liu, Zhening and Li, Zijian and Lin, Zehong and Zhang, Jun},
  journal={arXiv preprint arXiv:2603.01185},
  year={2026}
}

About

[ICLR26] code for paper "Token-level Data Selection for Safe LLM Fine-tuning"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors