[issue] Surprising Performance Drop When Using <think> Instead of <reasoning> as Custom Tags for Fine-tuning #3039

l-besiege-l · 2025-07-23T15:18:27Z

l-besiege-l
Jul 23, 2025

Hello Unsloth team!

Please excuse this beginner question. I'm new to the world of fine-tuning, and your library has been a fantastic and accessible starting point for me. While experimenting, I've encountered some model behavior that I don't understand and was hoping to get some clarification on what feels like a fundamental concept.

1. Did you update?

Yes, pip install --upgrade unsloth is up to date.

2. `Colab` or `Kaggle` or local / cloud

Local.

3. Number GPUs used

1x NVIDIA GeForce RTX 4090

4. Which notebook? Please link!

I only modified the custom tag in the official qwen3-4b-gpro example and removed some unnecessary output checks. Below is the link to the online notebook. https://colab.research.google.com/drive/1id4WqGn3yDZ4uOEmQI5HCR8UM1S64H07?usp=sharing

5. Which Unsloth version, TRL version, etc.?

Transformers: 4.53.2. vLLM: 0.9.2.
NVIDIA GeForce RTX 4090. Num GPUs = 2. Max memory: 23.514 GB. Platform: Linux.
Torch: 2.7.0+cu126. CUDA: 8.9. CUDA Toolkit: 12.6. Triton: 3.3.0

6. Which trainer?

GRPOTrainer (but the same issue is observable with SFTTrainer).

Problem Description

I am trying to fine-tune the unsloth/Qwen3-8B-Base model for mathematical reasoning. My goal is to teach the model to first "think" about the problem and then provide a final answer, using a specific format.

I conducted an experiment with two scenarios. The only difference between them was the custom tags I used in my data formatting.

Scenario A: This works perfectly.
I used <reasoning> and <answer> as my custom tags. The model learns the format very well and generates responses that follow the assistant: <reasoning>...</reasoning><answer>...</answer> structure.

reasoning_start = "<reasoning>" 
reasoning_end   = "</reasoning>"   
solution_start  = "<answer>"
solution_end    = "</answer>"

system_prompt = \
f"""You are given a problem.
Think about the problem and provide your working out.
Place it between {reasoning_start} and {reasoning_end}.
Then, provide your solution between {solution_start}{solution_end}"""

Scenario B: This performs very poorly.
I changed the tags from <reasoning> to <think>. So the target format became assistant: <think>...</think><answer>...</answer>. To my surprise, the model completely fails to learn this format. The output is often incoherent, and it doesn't follow the desired structure at all.

reasoning_start = "<think>" 
reasoning_end   = "</think>"   
solution_start  = "<answer>"
solution_end    = "</answer>"

system_prompt = \
f"""You are given a problem.
Think about the problem and provide your working out.
Place it between {reasoning_start} and {reasoning_end}.
Then, provide your solution between {solution_start}{solution_end}"""

Is there something wrong with my code? How should I fix it? Thank you for your time!

shimmyshimmer · 2025-07-24T00:41:18Z

shimmyshimmer
Jul 24, 2025
Maintainer

Ok this is a very odd issue since the instruct versions with reasoning use so theoretically should perform better than but I do know you're using Qwen3-Base which shouldn't have that much impact.

Honestly, we aren't exactly sure what the issue is since your results showcase the opposite of what Qwen3 uses so the solution might just be to use the reasoning tag

0 replies

l-besiege-l · 2025-07-24T03:04:09Z

l-besiege-l
Jul 24, 2025
Author

Thanks for the quick response!

I'm using the qwen3-4b-base model, not an instruct-tuned version, which is consistent with the official Unsloth example I'm following.
That's what makes this issue so puzzling to me. Since it's a base model, I wouldn't expect it to have any prior preference for the tag over a semantically similar tag like . The model performs well with one tag but not the other. This suggests the base model itself might have some unexpected latent conditioning from its pre-training data.
I appreciate the suggestion to stick with the tag as a workaround for now. Thanks again for your time and help! If you have any further insights or suggestions, I'd be very interested to hear them.

0 replies

shimmyshimmer · 2025-07-25T11:30:26Z

shimmyshimmer
Jul 25, 2025
Maintainer

For now I will be moving this issue to discussion but if you have anymore questions please feel free to ask or if anyone wants to add to the discussion!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[issue] Surprising Performance Drop When Using <think> Instead of <reasoning> as Custom Tags for Fine-tuning #3039

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

[issue] Surprising Performance Drop When Using <think> Instead of <reasoning> as Custom Tags for Fine-tuning #3039

Uh oh!

l-besiege-l Jul 23, 2025

1. Did you update?

2. Colab or Kaggle or local / cloud

3. Number GPUs used

4. Which notebook? Please link!

5. Which Unsloth version, TRL version, etc.?

6. Which trainer?

Problem Description

Replies: 3 comments

Uh oh!

shimmyshimmer Jul 24, 2025 Maintainer

Uh oh!

l-besiege-l Jul 24, 2025 Author

Uh oh!

shimmyshimmer Jul 25, 2025 Maintainer

l-besiege-l
Jul 23, 2025

2. `Colab` or `Kaggle` or local / cloud

shimmyshimmer
Jul 24, 2025
Maintainer

l-besiege-l
Jul 24, 2025
Author

shimmyshimmer
Jul 25, 2025
Maintainer