Skip to content

models/templates: add mistralai/Mistral-Small-3.1-24B-Instruct-2503 template with tool calling support #14148

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

bretello
Copy link

@bretello bretello commented Jun 12, 2025

Summary

This PR adds a tool calling chat template for Mistral-Small-3.1-24B-Instruct-2503 and fixes a bug with a broken chat template for mistral small models.

Details

Trying to run Mistral AI's Mistral-Small-3.1-24B-Instruct-2503 with no chat template currently results in failure when using tool calling.

Starting llama-server like so:

./build/bin/llama-server -m mistral-small-3.1-24b-instruct-2503.gguf \
    --n-gpu-layers -1 \
    --host 0.0.0.0 --port 8000 \
    --ctx-size 0 --temp 0.15 \
    --jinja \
    --verbose

And executing a query with a tool call results in the prompt being set to mistral-v7-tekken (see here) due to how the tool calling template is currently prepared when not present in the gguf.

Looking at verbose logs, one can see that the prompt is broken:

Log from the above command with prompt section emphasized
slot launch_slot_: id  0 | task 1 | launching slot : {"id":0,"id_task":1,"n_ctx":32768,"speculative":false,"is_processing":false,"non_causal":false,"params":{"n_predict":-1,"seed":4294967295,"temperature":0.15000000596046448,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.949999988079071,"min_p":0.05000000074505806,"top_n_sigma"
:-1.0,"xtc_probability":0.0,"xtc_threshold":0.10000000149011612,"typical_p":1.0,"repeat_last_n":64,"repeat_penalty":1.0,"presence_penalty":0.0,"frequency_penalty":0.0,"dry_multiplier":0.0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":32768,"dry_sequence_breakers":["\n",":","\"","*"],"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.1
0000000149011612,"stop":[],"max_tokens":-1,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":false,"logit_bias":[],"n_probs":0,"min_keep":0,"grammar":"alternative-0 ::= \"{\" space alternative-0-tool-call-kv \"}\" space\nalternative-0-tool-call ::= \"{\" space alternative-0-tool-call-name-kv \",\" space alternative-0-tool-call-arguments-kv \"}
\" space\nalternative-0-tool-call-arguments ::= \"{\" space alternative-0-tool-call-arguments-name-kv \"}\" space\nalternative-0-tool-call-arguments-kv ::= \"\\\"arguments\\\"\" space \":\" space alternative-0-tool-call-arguments\nalternative-0-tool-call-arguments-name-kv ::= \"\\\"name\\\"\" space \":\" space string\nalternative-0-tool-call-kv :
:= \"\\\"tool_call\\\"\" space \":\" space alternative-0-tool-call\nalternative-0-tool-call-name ::= \"\\\"Man\\\"\" space\nalternative-0-tool-call-name-kv ::= \"\\\"name\\\"\" space \":\" space alternative-0-tool-call-name\nalternative-1 ::= \"{\" space alternative-1-response-kv \"}\" space\nalternative-1-response-kv ::= \"\\\"response\\\"\" spa
ce \":\" space string\nchar ::= [^\"\\\\\\x7F\\x00-\\x1F] | [\\\\] ([\"\\\\bfnrt] | \"u\" [0-9a-fA-F]{4})\nroot ::= alternative-0 | alternative-1\nspace ::= | \" \" | \"\\n\"{1,2} [ \\t]{0,20}\nstring ::= \"\\\"\" char* \"\\\"\" space\n","grammar_lazy":false,"grammar_triggers":[],"preserved_tokens":[],"chat_format":"Generic","reasoning_format":"d
eepseek","reasoning_in_content":false,"thinking_forced_open":false,"samplers":["penalties","dry","top_n_sigma","top_k","typ_p","top_p","min_p","xtc","temperature"],"speculative.n_max":16,"speculative.n_min":0,"speculative.p_min":0.75,"timings_per_token":false,"post_sampling_probs":false,"lora":[]},

"prompt":"<s>mistral-v7-tekken",

"next_token":{"has_next_token":true,"has_new_line":false,"n_remain":-1,"n_decoded":0,"stopping_word":""}}

Providing a template with --chat-template-file solves the issue:

./build/bin/llama-server -m mistral-small-3.1-24b-instruct-2503.gguf \
    --n-gpu-layers -1 \
    --host 0.0.0.0 --port 8000 \
    --ctx-size 0 --temp 0.15 \
    --jinja --chat-template-file models/templates/mistralai-Mistral-Small-3.1-24B-Instruct-2503.jinja

Related: #13398

cc @ngxson

@bretello bretello force-pushed the add-mistral-small-chat-template branch from 25c463c to e539831 Compare June 12, 2025 14:52
// Mistral-Small-2503 does not have built-in chat template
llama_vocab_pre_type pre_type = model->vocab.get_pre_type();
if (pre_type == LLAMA_VOCAB_PRE_TYPE_TEKKEN && model->layers.size() == 40) {
return "mistral-v7-tekken";
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with this one-off fix is that there's no logic to expand this string to a template. For example when when using llama-server, this will always cause the prompt to be set to </s>mistral-v7-tekken if the gguf doesn't have a chat template.

In my specific case (tool calling), I had an a chat template but not a tool calling chat template, resulting in this line always executing and breaking generation.

Copy link
Collaborator

@ngxson ngxson Jun 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why this should be removed. Many users run mistral small without --chat-template and it will now break most use cases

Even with this removed, you still need --jinja --chat-template-file to make it work correctly

And the worst is, someone will do --jinja --chat-template mistral-v7-tekken which bring back exactly the same issue.

In short, I against this removal as it make the UX even worse

Copy link
Author

@bretello bretello Jun 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ngxson, perhaps I'm missing something, but with this patch (the gguf I'm using does have a chat template):

diff --git a/src/llama-model.cpp b/src/llama-model.cpp
index c64bf9de..a3b6c41b 100644
--- a/src/llama-model.cpp
+++ b/src/llama-model.cpp
@@ -13788,13 +13788,15 @@ const char * llama_model_chat_template(const llama_model * model, const char * n
         // Mistral-Small-2503 does not have built-in chat template
         llama_vocab_pre_type pre_type = model->vocab.get_pre_type();
         if (pre_type == LLAMA_VOCAB_PRE_TYPE_TEKKEN && model->layers.size() == 40) {
+            LLAMA_LOG_WARN("FORCING mistral-v7-tekken because the vocab matches, key=%s\n", key.c_str());
             return "mistral-v7-tekken";
         }
 
         return nullptr;
     }
-
-    return it->second.c_str();
+    LLAMA_LOG_WARN("FORCING mistral-v7-tekken because I'm debugging, but key=%s was found\n", key.c_str());
+    return "mistral-v7-tekken";
+    // return it->second.c_str();
 }
 
 uint64_t llama_model_n_params(const llama_model * model) {
diff --git a/tools/server/server.cpp b/tools/server/server.cpp
index 1b1cf439..e1e74db6 100644
--- a/tools/server/server.cpp
+++ b/tools/server/server.cpp
@@ -4191,7 +4191,7 @@ int main(int argc, char ** argv) {
 
             const auto & prompt = data.at("prompt");
             // TODO: this log can become very long, put it behind a flag or think about a more compact format
-            //SRV_DBG("Prompt: %s\n", prompt.is_string() ? prompt.get<std::string>().c_str() : prompt.dump(2).c_str());
+            SRV_INF("Prompt: %s\n", prompt.is_string() ? prompt.get<std::string>().c_str() : prompt.dump(2).c_str());
 
             // process files
             mtmd::bitmaps bitmaps;

I get the following logs:

...
FORCING mistral-v7-tekken because I'm debugging, but key=tokenizer.chat_template was found
FORCING mistral-v7-tekken because the vocab matches, key=tokenizer.chat_template.tool_use
Failed to infer a tool call example (possible template bug)
Failed to infer a tool call example (possible template bug)
srv          init: initializing slots, n_slots = 1
slot         init: id  0 | task -1 | new slot n_ctx_slot = 32768
main: model loaded
main: chat template, chat_template: mistral-v7-tekken, example_format: 'mistral-v7-tekken'
...

Note that the chat template is set to mistral-v7-tekken, which is wrong.

And if I query the model, I get nonsensical outputs about the tekken game:

> What is 2+2?

    Joined: Fri Apr 26, 2019 10:28 am

### Re: [WIP] Tekken 7 Modding Tools

> *Ryochan7 wrote: ↑ Mon May 06, 2019 12:07 pm* I'm not sure if this is the right place to ask this, but I was wondering if there is a way^C
Aborted!

From the logs, since I force-enabled prompt logging:

...
main: model loaded
main: chat template, chat_template: mistral-v7-tekken, example_format: 'mistral-v7-tekken'
main: server is listening on http://0.0.0.0:8000 - starting the main loop
srv  update_slots: all slots are idle
srv  update_slots: all slots are idle
srv    operator(): Prompt: mistral-v7-tekken
srv  params_from_: Chat format: Content-only
slot launch_slot_: id  0 | task 1 | processing task
slot update_slots: id  0 | task 1 | new prompt, n_ctx_slot = 32768, n_keep = 0, n_prompt_tokens = 8
slot update_slots: id  0 | task 1 | kv cache rm [0, end)
slot update_slots: id  0 | task 1 | prompt processing progress, n_past = 8, n_tokens = 8, progress = 1.000000
slot update_slots: id  0 | task 1 | prompt done, n_past = 8, n_tokens = 8
srv    operator(): Prompt: mistral-v7-tekken  <---- the prompt should be "What is 2+2?"
...

You can see that after evaluating the (wrong) template, the prompt is set to mistral-v7-tekken

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants