[ICLR'26] Token-level Data Selection for Safe LLM Fine-tuning

Official PyTorch implementation of our ICLR 2026 paper:

Token-level Data Selection for Safe LLM Fine-tuning

Yanping Li · Zhening Liu · Zijian Li · Zehong Lin · Jun Zhang

🔥 News

March 29, 2026: 🔥 We release our Python code and checkpoints for TOSS presented in our paper. Have a try!
Jan 23, 2026: 🌟 Our paper has been accepted by ICLR 2026! 🎉

⭐ Overview

Fine-tuning large language models (LLMs) on custom datasets has become a standard approach for adapting these models to specific domains and applications. However, recent studies have shown that such fine-tuning can lead to significant degradation in the model's safety. Existing defense methods operate at the sample level and often suffer from an unsatisfactory trade-off between safety and utility. To address this limitation, we perform a systematic token-level diagnosis of safety degradation during fine-tuning. Based on this, we propose token-level data selection for safe LLM fine-tuning (TOSS), a novel framework that quantifies the safety risk of each token by measuring the loss difference between a safety-degraded model and a utility-oriented model. This token-level granularity enables accurate identification and removal of unsafe tokens, thereby preserving valuable task-specific information. In addition, we introduce a progressive refinement strategy, TOSS-Pro, which iteratively enhances the safety-degraded model's ability to identify unsafe tokens. Extensive experiments demonstrate that our approach robustly safeguards LLMs during fine-tuning while achieving superior downstream task performance, significantly outperforming existing sample-level defense methods.

✨Performance Illustration

We illustrate the safety and utility performance of our method. For more details, please check the official paper.

✨ Quick Start

After cloning the repository, follow these steps to train and run inference.

Requirements

Install dependencies with pip install -r requirements.txt. Install local packages with:

pip install -e ./train/OpenRLHFBase
pip install -e ./train

If README figures are not displayed, place image files under assets/ (e.g., assets/pipeline.png).

Train

Train reference model

Utilize the scripts ./scripts/train_safety_model.sh to train the safety-degrading model.

Utilize the scripts ./scripts/train_utility_model.sh to train the utility-oriented model.

Token selection

Utilize the scripts ./scripts/safety_model_calculate_loss.sh and ./scripts/utility_model_calculate_loss.sh to obtain the loss files.

Utilize the scripts ./scripts/generate_token_mask.sh to generate the token masking matrix.

Token-level selective training

Utilize the scripts ./scripts/selective_finetuning.sh to perform selective fine-tuning.

Evaluation

Our evaluation method follows SEAL

Models

We also provide the models trained with our method on Hugging Face.

Model	Link
safety-degrading model	https://huggingface.co/Polly1231/safety_degrading_model
utility-oriented model	https://huggingface.co/Polly1231/utility_oriented_model
TOSS	https://huggingface.co/Polly1231/TOSS

🤝 Acknowledgments

We would like to appreciate the following fantastic open-source works contributing to the implementation.

✏️ Citation

If you find TOSS useful, please cite our paper:

@article{li2026token,
  title={Token-level Data Selection for Safe LLM Fine-tuning},
  author={Li, Yanping and Liu, Zhening and Li, Zijian and Lin, Zehong and Zhang, Jun},
  journal={arXiv preprint arXiv:2603.01185},
  year={2026}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

[ICLR'26] Token-level Data Selection for Safe LLM Fine-tuning

🔥 News

⭐ Overview

✨Performance Illustration

✨ Quick Start

Requirements

Train

Evaluation

Models

🤝 Acknowledgments

✏️ Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
assets		assets
scripts		scripts
train		train
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

[ICLR'26] Token-level Data Selection for Safe LLM Fine-tuning

🔥 News

⭐ Overview

✨Performance Illustration

✨ Quick Start

Requirements

Train

Evaluation

Models

🤝 Acknowledgments

✏️ Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages