Example Notebook for Advanced AI Safety Training (SFT +GRPO) #3407

surfiniaburger · 2025-10-03T21:12:32Z

surfiniaburger
Oct 3, 2025

Hi Unsloth Team,

First, I just want to say thank you for creating such a powerful and efficient library. It's been instrumental in my work.

I've put together a comprehensive, end-to-end example notebook that demonstrates a full SFT-then-GRPO pipeline for a high-stakes AI safety task. The notebook is fully runnable on Kaggle and is completely self-contained, as it synthetically generates its own dataset.

You can view and run the notebook here:
https://www.kaggle.com/code/surfiniaburger/dipg-gemma-grpo-3

What the notebook demonstrates:

Self-Contained Dataset Generation: It programmatically creates a synthetic dataset for a complex medical domain (DIPG), meaning anyone can run the notebook from start to finish without downloading external files.
Complete SFT + GRPO Workflow: It provides a clear, step-by-step guide on how to first fine-tune a model with SFTTrainer and then harden its behavior with custom reward functions using the GRPOTrainer.
Structured Output Formatting: It tackles the challenge of training a model to produce a specific, multi-part output (analysis -> final), which is a common requirement for building reliable agents.
Honest AI Safety Application: The notebook frames the training process around a real-world safety problem and transparently evaluates the final model, including a discussion of why the GRPO hardening did not succeed in this instance, making it a valuable learning resource.

Why this might be a good example for the Unsloth community:

It showcases unsloth's seamless compatibility with the more advanced features of trl, like GRPOTrainer.
It provides a practical, real-world example that goes beyond simple instruction-tuning.
The self-contained nature makes it incredibly easy for other users to run, learn from, and adapt for their own projects.

I wanted to share this with you and the community.

Thanks again for your incredible work on this library.

shimmyshimmer · 2025-10-04T06:16:32Z

shimmyshimmer
Oct 4, 2025
Maintainer

Thank you this is great!

1 reply

surfiniaburger Oct 4, 2025
Author

Thanks Mike for all that you do. I'm grateful!

surfiniaburger · 2025-10-04T13:26:52Z

surfiniaburger
Oct 4, 2025
Author

Hey everyone,

Just wanted to post an update and share some key learnings from this project. After getting some great feedback and digging deeper, I've refactored the notebook to align with Hugging Face's best practices for chat templates and to correctly handle the data requirements for GRPOTrainer.

The original version worked, but it hardcoded the chat format into the dataset. The new version is much more robust and reusable.

Here's the updated, runnable notebook:
https://www.kaggle.com/code/surfiniaburger/dipg-gemma-grpo-3

Key Learnings & Changes:

This was a fantastic learning experience, and I wanted to document the subtle but critical distinction between preparing data for SFTTrainer vs. GRPOTrainer.

Universal Data Format: The data generator no longer creates a single formatted string. It now outputs a messages column containing a neutral list of dictionaries ([{"role": "user", ...}]). This separates the data's content from the model's formatting.
Separate Preprocessing for SFT vs. GRPO: This was the biggest "aha!" moment.
- SFTTrainer needs the entire conversation (prompt + answer) to learn imitation. So, for SFT, we apply the chat template to the full messages list to create a single text column.
- GRPOTrainer, however, needs only the prompt. Its job is to generate its own answers. The fix was to create a new, separate data preparation step that takes the messages list, drops the final assistant turn, and uses tokenizer.apply_chat_template(..., add_generation_prompt=True) to create a perfect prompt column for GRPO.
Updating Reward Functions: Because the GRPOTrainer now only sees the model's completion (e.g., starting with <|channel|>...), the reward functions had to be updated to look for these shorter markers instead of the full <|start|><|assistant|>... sequence.

I hope these details are helpful for anyone else who might run into the same subtle KeyError issues when moving from SFT to reinforcement learning trainers. The final notebook now reflects this more robust, multi-stage data preparation workflow.

Cheers

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Example Notebook for Advanced AI Safety Training (SFT +GRPO) #3407

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Example Notebook for Advanced AI Safety Training (SFT +GRPO) #3407

Uh oh!

surfiniaburger Oct 3, 2025

Replies: 2 comments · 1 reply

Uh oh!

shimmyshimmer Oct 4, 2025 Maintainer

Uh oh!

surfiniaburger Oct 4, 2025 Author

Uh oh!

surfiniaburger Oct 4, 2025 Author

Key Learnings & Changes:

surfiniaburger
Oct 3, 2025

Replies: 2 comments 1 reply

shimmyshimmer
Oct 4, 2025
Maintainer

surfiniaburger Oct 4, 2025
Author

surfiniaburger
Oct 4, 2025
Author