llama-model : add dots.llm1 architecture support (#14044) #14118

Noeda · 2025-06-11T06:10:26Z

Add support for "dots.llm1" architecture. I decided to shorten that to dots1/DOTS1 in the code.

Tracking issue: #14044

These are the models that exist that use this:

There is also a paper: https://github.com/rednote-hilab/dots.llm1/blob/main/dots1_tech_report.pdf (you can find this link in their Huggingface page).

And RedNote appears to have a GitHub page for this model as well: https://github.com/rednote-hilab/dots.llm1

The architecture has DeepseekV2+ MoE code but Qwen3 attention, kind of a mix:

https://github.com/huggingface/transformers/blob/ffe12627b4e84489d2ab91dd0ec00614855edc79/src/transformers/models/dots1/modular_dots1.py

The model is 32k context MoE model at 142B total parameters, and 14B activated parameters. It has its own new chat template and token for them.

I think this is maybe the lab's very first model, I see absolutely no other history from them and I've never heard of them before. The model itself seems fairly ok, similar smarts to other recent local models of this kind of size, but not sure I dare to make strong claims of if it is good or not when my experience is purely anecdotal.

This PR has:

Adding the various _DOTS1 constants across wherever new architecture code is added.
DotsModel introduced to convert_hf_to_gguf.py to convert the models.
I added the chat template to llama-chat.cpp to use it for llama-server, following the Huggingface transformers code.

The extent of my testing is that I've been checking the model doesn't break into gibberish even on long contexts, and that chat templates are used correctly (the rednote team fixed their HF safetensors model tokenization config files because they had the wrong EOS token based on this testing).

Some examples of prompting here: #14044 (comment)

I was planning to do some better verification before opening this PR than "prompt it and check it doesn't respond in gibberish" but did not have the time. I'm about to travel away for two weeks and I'm losing access to my Mac Studio I usually develop on, so I might have to ask someone to run perplexity tests or some double-checking on the computation graph. (That or maybe later I'll try rent some big memory server to do that checking myself). I think computation graph is likely is correct computation-wise, or almost correct because it hasn't visibly broken even for long (longest I've tried is about ~27k context length) prompts. Comparing with HF implementation would be nice too, I've done so before for GLM-4 when it came out recently. But I can do this verification myself later on.

The conversion code (convert_hf_to_gguf.py) and the computation graph code (llama-model.cpp) have been created by following Qwen3 and Deepseek code since the model architecture is using parts from them almost as-is. IMO the parts that could use more scrutiny in this PR are llm_build_dots1 and that case LLM_ARCH_DOTS1: where it is reading which tensors to get. And maybe checking that I didn't forget some parts that are usually added with a new architecture.

I also wanted to check running on valgrind or some leak detection in case the graph code is leaking something (not sure if it would visibly warn you regardless; I didn't get compiler warnings about unused variables at least).

The transformers code I used as reference that the rednote team has made is still in a PR here, so this model is not yet part of the transformers Python library: huggingface/transformers#38143 (it's open as of me opening this PR here).

Noeda · 2025-06-11T06:13:57Z

Forgot to mention, @ddh0 made some quants, although I think right now you should run these with --override-kv tokenizer.ggml.eos_token_id=int:151649 because when I checked the metadata they had the wrong EOS token (the source safetensor files I presume was before the upstream team fixed their EOS token on Huggingface side):

https://huggingface.co/ddh0/dots.llm1.inst-GGUF-Q4_0-EXPERIMENTAL

This commit adds support for "dots.llm1" (I decided to shorten it to dots1 or DOTS1 in the code generally) architecture. The only models that exist as of writing of this commit that follow this architecture are "dots.llm1.inst" and "dots.llm1.base" from here: * https://huggingface.co/rednote-hilab/dots.llm1.inst * https://huggingface.co/rednote-hilab/dots.llm1.base The model architecture is a combination of Qwen and Deepseek parts, as seen here: https://github.com/huggingface/transformers/blob/ffe12627b4e84489d2ab91dd0ec00614855edc79/src/transformers/models/dots1/modular_dots1.py --- Parts in this commit: Adding various "_DOTS1" constants around the codebase where a new architecture is expected. DotsModel in convert_hf_to_gguf.py to be used on Dots1ForCausalLM, to convert the model to .ggufs. It was made by following Qwen and DeepseekV2 converters (mostly Deepseek one was relevant). I added the graph code and architecture code in llama-model.cpp, it too was made by following qwen3 and deepseek codepaths, and doing some trial and error until coherent text came out. I added detection for the dots chat template so that it can pick it up. As of writing of this (10 June 2025) I did not have the opportunity to do more thorough testing than "prompt it and check does it respond with gibberish".

Noeda · 2025-06-11T06:18:19Z

Force-pushed a tiny fix for a linter error ^

jacekpoplawski · 2025-06-11T08:44:29Z

I was able to run it:

~/git/llama.cpp/build_2025.06.11_dots$ ./bin/llama-cli -ngl 50 -m /mnt/models3/dots.llm1.inst-Q4_0.gguf --override-kv tokenizer.ggml.eos_token_id=int:151649 -p "who are you?" 2>/dev/null
You are a helpful assistant.who are you?I'm dots, your AI assistant created by rednote-hilab! 🌟I'm here to help you with all kinds of questions—whether you need information, advice, or just someone to chat with. I can analyze documents, summarize text, explain concepts, and even brainstorm ideas. How can I assist you today? 😊

~/git/llama.cpp/build_2025.06.11_dots$ ./bin/llama-cli -ngl 50 -m /mnt/models3/dots.llm1.inst-Q4_0.gguf --override-kv tokenizer.ggml.eos_token_id=int:151649 -p "list 10 AI companies" 2>/dev/null
You are a helpful assistant.list 10 AI companiesHere’s a list of **10 notable AI companies** (as of mid-2**2024**), spanning both well-established giants and innovative startups:

### **1. Big Tech (AI Leaders)**
1. **Google (Alphabet)** – DeepMind, Google AI, TensorFlow
2. **Microsoft** – Azure AI, Copilot, OpenAI partnership
3. **Meta (Facebook)** – FAIR, Llama models, AI research
4. **Amazon** – AWS AI/ML, Alexa, Bedrock (foundation models)
5. **Apple** – Core ML, Siri advancements, AI in AR/VR

### **2. OpenAI & AI Pioneers**
6. **OpenAI** – ChatGPT, GPT-4, DALL·E
7. **Anthropic** – Claude AI, safety-focused AI

### **3. AI Infrastructure & Tools**
8. **NVIDIA** – GPUs for AI, Omniverse, DGX systems
9. **Hugging Face** – Leader in open-source ML models (Transformers library)

### **4. Emerging/Vertical AI Startups**
10. **Runway** – GenAI for video/creative tools (used in Hollywood)

### **Honorable Mentions:**
- **Tesla** (Autopilot, Dojo supercomputing)
- **DeepMind** (separate from Google, but now integrated)
- **Cohere** (enterprise NLP)
- **Inflection AI** (Pi chatbot)

Would you like a focus on a specific niche (e.g., healthcare AI, autonomous systems)?

load_tensors: offloaded 50/63 layers to GPU
load_tensors:        CUDA0 model buffer size = 21158.12 MiB
load_tensors:        CUDA1 model buffer size = 21158.12 MiB
load_tensors:        CUDA2 model buffer size =  9956.76 MiB
load_tensors:        CUDA3 model buffer size =  9956.76 MiB
load_tensors:   CPU_Mapped model buffer size = 14620.13 MiB

(...)

llama_perf_sampler_print:    sampling time =      25.94 ms /   306 runs   (    0.08 ms per token, 11796.45 tokens per second)
llama_perf_context_print:        load time =   17196.71 ms
llama_perf_context_print: prompt eval time =     521.42 ms /    13 tokens (   40.11 ms per token,    24.93 tokens per second)
llama_perf_context_print:        eval time =   17150.21 ms /   292 runs   (   58.73 ms per token,    17.03 tokens per second)
llama_perf_context_print:       total time =   18636.09 ms /   305 tokens

ngxson · 2025-06-11T11:01:34Z

convert_hf_to_gguf.py

@@ -5262,6 +5262,108 @@ def prepare_tensors(self):
                raise ValueError(f"Unprocessed experts: {experts}")


+@ModelBase.register("Dots1ForCausalLM")
+class DotsModel(TextModel):


There is a more simple way for the conversion code: https://github.com/ngxson/llama.cpp/blob/b469d9b86e148c4d7538ad27f817cf83bc2fb339/convert_hf_to_gguf.py#L3070-L3087

ngxson · 2025-06-11T11:02:47Z

src/llama-chat.cpp

+    } else if (tmpl_contains("<|userprompt|>") &&
+               tmpl_contains("<|endofuserprompt|>") &&
+               tmpl_contains("<|response|>") &&
+               tmpl_contains("<|endofresponse|>")) {
+        return LLM_CHAT_TEMPLATE_DOTS1;


Suggested change

} else if (tmpl_contains("<|userprompt|>") &&

tmpl_contains("<|endofuserprompt|>") &&

tmpl_contains("<|response|>") &&

tmpl_contains("<|endofresponse|>")) {

return LLM_CHAT_TEMPLATE_DOTS1;

} else if (tmpl_contains("<|endofuserprompt|>")) {

return LLM_CHAT_TEMPLATE_DOTS1;

we don't need to check exhaustively since this is the only model which uses this <|endofuserprompt|> marker

ngxson · 2025-06-11T11:03:44Z

src/llama-chat.cpp

+                ss << "<|system|>" << message->content << "<|endofsystem|>";
+            } else if (role == "user") {
+                ss << "<|userprompt|>" << message->content << "<|endofuserprompt|>";
+            } else if (role == "assistant") {


Suggested change

} else if (role == "assistant") {

} else {

To make it the same as other code blocks

ngxson · 2025-06-11T11:06:19Z

gguf-py/gguf/constants.py

+    MODEL_ARCH.DOTS1: [
+        MODEL_TENSOR.ROPE_FREQS,
+        MODEL_TENSOR.ATTN_ROT_EMBD,
+    ],


I think this is not necessary

ngxson · 2025-06-11T11:07:01Z

gguf-py/gguf/constants.py

+        MODEL_TENSOR.ATTN_K_NORM,
+        MODEL_TENSOR.ATTN_V,
+        MODEL_TENSOR.ATTN_OUT,
+        MODEL_TENSOR.ATTN_ROT_EMBD,


Suggested change

MODEL_TENSOR.ATTN_ROT_EMBD,

DocShotgun · 2025-06-12T05:37:50Z

Doing some local testing at the moment on ddh0's q4_0 quant. Text seems coherent so far and getting decent speed on RTX PRO 6000 96gb with 32k ctx allocated with q8_0 cache and flash attention:

load_tensors:        CUDA0 model buffer size = 76515.78 MiB
load_tensors:   CPU_Mapped model buffer size =   334.12 MiB
...
llama_context:  CUDA_Host  output buffer size =     0.58 MiB
llama_kv_cache_unified:      CUDA0 KV buffer size = 16864.00 MiB
llama_kv_cache_unified: size = 16864.00 MiB ( 32768 cells,  62 layers,  1 seqs), K (q8_0): 8432.00 MiB, V (q8_0): 8432.00 MiB
llama_context:      CUDA0 compute buffer size =   321.00 MiB
llama_context:  CUDA_Host compute buffer size =    72.01 MiB
...
prompt eval time =     241.08 ms /   416 tokens (    0.58 ms per token,  1725.58 tokens per second)
       eval time =     484.00 ms /    38 tokens (   12.74 ms per token,    78.51 tokens per second)
      total time =     725.08 ms /   454 tokens

I noticed the occasional Chinese character appearing mid-text or occasional typo/nonsense word with sampling settings of temp 1 and min-p 0.1. I'm not sure whether to attribute this to the model itself versus the q4_0 quantization versus the q8_0 cache versus the arch implementation. For example, the model invented the word "smirpsilon" with the tokens being sm + ir + psilon, and the logprobs look very strange at that position:

jukofyork · 2025-06-12T14:31:32Z

Doing some local testing at the moment on ddh0's q4_0 quant. Text seems coherent so far and getting decent speed on RTX PRO 6000 96gb with 32k ctx allocated with q8_0 cache and flash attention:
load_tensors:        CUDA0 model buffer size = 76515.78 MiB
load_tensors:   CPU_Mapped model buffer size =   334.12 MiB
...
llama_context:  CUDA_Host  output buffer size =     0.58 MiB
llama_kv_cache_unified:      CUDA0 KV buffer size = 16864.00 MiB
llama_kv_cache_unified: size = 16864.00 MiB ( 32768 cells,  62 layers,  1 seqs), K (q8_0): 8432.00 MiB, V (q8_0): 8432.00 MiB
llama_context:      CUDA0 compute buffer size =   321.00 MiB
llama_context:  CUDA_Host compute buffer size =    72.01 MiB
...
prompt eval time =     241.08 ms /   416 tokens (    0.58 ms per token,  1725.58 tokens per second)
       eval time =     484.00 ms /    38 tokens (   12.74 ms per token,    78.51 tokens per second)
      total time =     725.08 ms /   454 tokens
I noticed the occasional Chinese character appearing mid-text or occasional typo/nonsense word with sampling settings of temp 1 and min-p 0.1. I'm not sure whether to attribute this to the model itself versus the q4_0 quantization versus the q8_0 cache versus the arch implementation. For example, the model invented the word "smirpsilon" with the tokens being sm + ir + psilon, and the logprobs look very strange at that position:

That does look pretty odd - when qwen-2 and other Chinese models do this, then the Chinese word they insert usually makes sense when you translate it, but this just looks garbled.

It could be an overflow maybe? IIRC, the qwen-2 architecture suffered really badly from overflows; both here in llama.cpp and for people trying to generate exllamav2 quants (usually in the last couple of layers the activations would grow larger than the range of FP16).

DocShotgun · 2025-06-12T14:48:04Z

It would be interesting to see if the vllm/transformers implementation has any issues like this. The logprobs make it look like the model is absolutely baffled at that token position - as none of the options that show up there are sane for what should follow "smir" lol.

jukofyork · 2025-06-12T15:21:07Z

@DocShotgun I'm jealous of the RTX PRO 6000 96gb!

I had a Max-Q version ordered since March, but after scan delayed it for the 4th time:

https://www.scan.co.uk/shop/computer-hardware/gpu-nvidia-workstation/nvidia-workstation-visualisation-graphics-cards

I cancelled it because it will be sod's law the MaxQ OEM will be the very last one they get :/

(just noticed the dates have all moved again to mid/late July now!)

github-actions bot added the python python script changes label Jun 11, 2025

Noeda force-pushed the dots1_squished_squashy branch from 1c1517b to 16dc0f4 Compare June 11, 2025 06:16

ngxson reviewed Jun 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

llama-model : add dots.llm1 architecture support (#14044) #14118

llama-model : add dots.llm1 architecture support (#14044) #14118

Noeda commented Jun 11, 2025

Uh oh!

Noeda commented Jun 11, 2025

Uh oh!

Noeda commented Jun 11, 2025

Uh oh!

jacekpoplawski commented Jun 11, 2025 •

edited

Loading

Uh oh!

ngxson Jun 11, 2025

Uh oh!

ngxson Jun 11, 2025

Uh oh!

ngxson Jun 11, 2025

Uh oh!

ngxson Jun 11, 2025

Uh oh!

ngxson Jun 11, 2025

Uh oh!

DocShotgun commented Jun 12, 2025 •

edited

Loading

Uh oh!

jukofyork commented Jun 12, 2025 •

edited

Loading

Uh oh!

DocShotgun commented Jun 12, 2025

Uh oh!

jukofyork commented Jun 12, 2025 •

edited

Loading

Uh oh!

Uh oh!

llama-model : add dots.llm1 architecture support (#14044) #14118

Are you sure you want to change the base?

llama-model : add dots.llm1 architecture support (#14044) #14118

Conversation

Noeda commented Jun 11, 2025

Uh oh!

Noeda commented Jun 11, 2025

Uh oh!

Noeda commented Jun 11, 2025

Uh oh!

jacekpoplawski commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

DocShotgun commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jukofyork commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DocShotgun commented Jun 12, 2025

Uh oh!

jukofyork commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

jacekpoplawski commented Jun 11, 2025 •

edited

Loading

DocShotgun commented Jun 12, 2025 •

edited

Loading

jukofyork commented Jun 12, 2025 •

edited

Loading

jukofyork commented Jun 12, 2025 •

edited

Loading