Example Notebook for Advanced AI Safety Training (SFT +GRPO) #3407
Replies: 2 comments 1 reply
-
|
Thank you this is great! |
Beta Was this translation helpful? Give feedback.
-
|
Hey everyone, Just wanted to post an update and share some key learnings from this project. After getting some great feedback and digging deeper, I've refactored the notebook to align with Hugging Face's best practices for chat templates and to correctly handle the data requirements for The original version worked, but it hardcoded the chat format into the dataset. The new version is much more robust and reusable. Here's the updated, runnable notebook: Key Learnings & Changes:This was a fantastic learning experience, and I wanted to document the subtle but critical distinction between preparing data for
I hope these details are helpful for anyone else who might run into the same subtle Cheers |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi Unsloth Team,
First, I just want to say thank you for creating such a powerful and efficient library. It's been instrumental in my work.
I've put together a comprehensive, end-to-end example notebook that demonstrates a full SFT-then-GRPO pipeline for a high-stakes AI safety task. The notebook is fully runnable on Kaggle and is completely self-contained, as it synthetically generates its own dataset.
You can view and run the notebook here:
https://www.kaggle.com/code/surfiniaburger/dipg-gemma-grpo-3
What the notebook demonstrates:
SFTTrainerand then harden its behavior with custom reward functions using theGRPOTrainer.analysis -> final), which is a common requirement for building reliable agents.Why this might be a good example for the Unsloth community:
unsloth's seamless compatibility with the more advanced features oftrl, likeGRPOTrainer.I wanted to share this with you and the community.
Thanks again for your incredible work on this library.
Beta Was this translation helpful? Give feedback.
All reactions