Skip to content

llama-model : add dots.llm1 architecture support (#14044) #14118

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Noeda
Copy link
Contributor

@Noeda Noeda commented Jun 11, 2025

Add support for "dots.llm1" architecture. I decided to shorten that to dots1/DOTS1 in the code.

Tracking issue: #14044


These are the models that exist that use this:

There is also a paper: https://github.com/rednote-hilab/dots.llm1/blob/main/dots1_tech_report.pdf (you can find this link in their Huggingface page).

And RedNote appears to have a GitHub page for this model as well: https://github.com/rednote-hilab/dots.llm1

The architecture has DeepseekV2+ MoE code but Qwen3 attention, kind of a mix:

https://github.com/huggingface/transformers/blob/ffe12627b4e84489d2ab91dd0ec00614855edc79/src/transformers/models/dots1/modular_dots1.py

The model is 32k context MoE model at 142B total parameters, and 14B activated parameters. It has its own new chat template and token for them.

I think this is maybe the lab's very first model, I see absolutely no other history from them and I've never heard of them before. The model itself seems fairly ok, similar smarts to other recent local models of this kind of size, but not sure I dare to make strong claims of if it is good or not when my experience is purely anecdotal.


This PR has:

  1. Adding the various _DOTS1 constants across wherever new architecture code is added.
  2. DotsModel introduced to convert_hf_to_gguf.py to convert the models.
  3. I added the chat template to llama-chat.cpp to use it for llama-server, following the Huggingface transformers code.

The extent of my testing is that I've been checking the model doesn't break into gibberish even on long contexts, and that chat templates are used correctly (the rednote team fixed their HF safetensors model tokenization config files because they had the wrong EOS token based on this testing).

Some examples of prompting here: #14044 (comment)

I was planning to do some better verification before opening this PR than "prompt it and check it doesn't respond in gibberish" but did not have the time. I'm about to travel away for two weeks and I'm losing access to my Mac Studio I usually develop on, so I might have to ask someone to run perplexity tests or some double-checking on the computation graph. (That or maybe later I'll try rent some big memory server to do that checking myself). I think computation graph is likely is correct computation-wise, or almost correct because it hasn't visibly broken even for long (longest I've tried is about ~27k context length) prompts. Comparing with HF implementation would be nice too, I've done so before for GLM-4 when it came out recently. But I can do this verification myself later on.

The conversion code (convert_hf_to_gguf.py) and the computation graph code (llama-model.cpp) have been created by following Qwen3 and Deepseek code since the model architecture is using parts from them almost as-is. IMO the parts that could use more scrutiny in this PR are llm_build_dots1 and that case LLM_ARCH_DOTS1: where it is reading which tensors to get. And maybe checking that I didn't forget some parts that are usually added with a new architecture.

I also wanted to check running on valgrind or some leak detection in case the graph code is leaking something (not sure if it would visibly warn you regardless; I didn't get compiler warnings about unused variables at least).

The transformers code I used as reference that the rednote team has made is still in a PR here, so this model is not yet part of the transformers Python library: huggingface/transformers#38143 (it's open as of me opening this PR here).

@Noeda
Copy link
Contributor Author

Noeda commented Jun 11, 2025

Forgot to mention, @ddh0 made some quants, although I think right now you should run these with --override-kv tokenizer.ggml.eos_token_id=int:151649 because when I checked the metadata they had the wrong EOS token (the source safetensor files I presume was before the upstream team fixed their EOS token on Huggingface side):

https://huggingface.co/ddh0/dots.llm1.inst-GGUF-Q4_0-EXPERIMENTAL

@github-actions github-actions bot added the python python script changes label Jun 11, 2025
This commit adds support for "dots.llm1" (I decided to shorten it to
dots1 or DOTS1 in the code generally) architecture.

The only models that exist as of writing of this commit that follow this
architecture are "dots.llm1.inst" and "dots.llm1.base" from here:

* https://huggingface.co/rednote-hilab/dots.llm1.inst

* https://huggingface.co/rednote-hilab/dots.llm1.base

The model architecture is a combination of Qwen and Deepseek parts, as
seen here:

https://github.com/huggingface/transformers/blob/ffe12627b4e84489d2ab91dd0ec00614855edc79/src/transformers/models/dots1/modular_dots1.py

---

Parts in this commit:

Adding various "_DOTS1" constants around the codebase where a new
architecture is expected.

DotsModel in convert_hf_to_gguf.py to be used on Dots1ForCausalLM, to
convert the model to .ggufs. It was made by following Qwen and
DeepseekV2 converters (mostly Deepseek one was relevant).

I added the graph code and architecture code in llama-model.cpp, it too
was made by following qwen3 and deepseek codepaths, and doing some trial
and error until coherent text came out.

I added detection for the dots chat template so that it can pick it up.

As of writing of this (10 June 2025) I did not have the opportunity to
do more thorough testing than "prompt it and check does it respond with
gibberish".
@Noeda Noeda force-pushed the dots1_squished_squashy branch from 1c1517b to 16dc0f4 Compare June 11, 2025 06:16
@Noeda
Copy link
Contributor Author

Noeda commented Jun 11, 2025

Force-pushed a tiny fix for a linter error ^

@jacekpoplawski
Copy link

jacekpoplawski commented Jun 11, 2025

I was able to run it:

~/git/llama.cpp/build_2025.06.11_dots$ ./bin/llama-cli -ngl 50 -m /mnt/models3/dots.llm1.inst-Q4_0.gguf --override-kv tokenizer.ggml.eos_token_id=int:151649 -p "who are you?" 2>/dev/null
You are a helpful assistant.who are you?I'm dots, your AI assistant created by rednote-hilab! 🌟I'm here to help you with all kinds of questions—whether you need information, advice, or just someone to chat with. I can analyze documents, summarize text, explain concepts, and even brainstorm ideas. How can I assist you today? 😊
~/git/llama.cpp/build_2025.06.11_dots$ ./bin/llama-cli -ngl 50 -m /mnt/models3/dots.llm1.inst-Q4_0.gguf --override-kv tokenizer.ggml.eos_token_id=int:151649 -p "list 10 AI companies" 2>/dev/null
You are a helpful assistant.list 10 AI companiesHere’s a list of **10 notable AI companies** (as of mid-2**2024**), spanning both well-established giants and innovative startups:

### **1. Big Tech (AI Leaders)**
1. **Google (Alphabet)** – DeepMind, Google AI, TensorFlow
2. **Microsoft** – Azure AI, Copilot, OpenAI partnership
3. **Meta (Facebook)** – FAIR, Llama models, AI research
4. **Amazon** – AWS AI/ML, Alexa, Bedrock (foundation models)
5. **Apple** – Core ML, Siri advancements, AI in AR/VR

### **2. OpenAI & AI Pioneers**
6. **OpenAI** – ChatGPT, GPT-4, DALL·E
7. **Anthropic** – Claude AI, safety-focused AI

### **3. AI Infrastructure & Tools**
8. **NVIDIA** – GPUs for AI, Omniverse, DGX systems
9. **Hugging Face** – Leader in open-source ML models (Transformers library)

### **4. Emerging/Vertical AI Startups**
10. **Runway** – GenAI for video/creative tools (used in Hollywood)

### **Honorable Mentions:**
- **Tesla** (Autopilot, Dojo supercomputing)
- **DeepMind** (separate from Google, but now integrated)
- **Cohere** (enterprise NLP)
- **Inflection AI** (Pi chatbot)

Would you like a focus on a specific niche (e.g., healthcare AI, autonomous systems)?
load_tensors: offloaded 50/63 layers to GPU
load_tensors:        CUDA0 model buffer size = 21158.12 MiB
load_tensors:        CUDA1 model buffer size = 21158.12 MiB
load_tensors:        CUDA2 model buffer size =  9956.76 MiB
load_tensors:        CUDA3 model buffer size =  9956.76 MiB
load_tensors:   CPU_Mapped model buffer size = 14620.13 MiB

(...)

llama_perf_sampler_print:    sampling time =      25.94 ms /   306 runs   (    0.08 ms per token, 11796.45 tokens per second)
llama_perf_context_print:        load time =   17196.71 ms
llama_perf_context_print: prompt eval time =     521.42 ms /    13 tokens (   40.11 ms per token,    24.93 tokens per second)
llama_perf_context_print:        eval time =   17150.21 ms /   292 runs   (   58.73 ms per token,    17.03 tokens per second)
llama_perf_context_print:       total time =   18636.09 ms /   305 tokens

@@ -5262,6 +5262,108 @@ def prepare_tensors(self):
raise ValueError(f"Unprocessed experts: {experts}")


@ModelBase.register("Dots1ForCausalLM")
class DotsModel(TextModel):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +186 to +190
} else if (tmpl_contains("<|userprompt|>") &&
tmpl_contains("<|endofuserprompt|>") &&
tmpl_contains("<|response|>") &&
tmpl_contains("<|endofresponse|>")) {
return LLM_CHAT_TEMPLATE_DOTS1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
} else if (tmpl_contains("<|userprompt|>") &&
tmpl_contains("<|endofuserprompt|>") &&
tmpl_contains("<|response|>") &&
tmpl_contains("<|endofresponse|>")) {
return LLM_CHAT_TEMPLATE_DOTS1;
} else if (tmpl_contains("<|endofuserprompt|>")) {
return LLM_CHAT_TEMPLATE_DOTS1;

we don't need to check exhaustively since this is the only model which uses this <|endofuserprompt|> marker

ss << "<|system|>" << message->content << "<|endofsystem|>";
} else if (role == "user") {
ss << "<|userprompt|>" << message->content << "<|endofuserprompt|>";
} else if (role == "assistant") {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
} else if (role == "assistant") {
} else {

To make it the same as other code blocks

Comment on lines +2129 to +2132
MODEL_ARCH.DOTS1: [
MODEL_TENSOR.ROPE_FREQS,
MODEL_TENSOR.ATTN_ROT_EMBD,
],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not necessary

MODEL_TENSOR.ATTN_K_NORM,
MODEL_TENSOR.ATTN_V,
MODEL_TENSOR.ATTN_OUT,
MODEL_TENSOR.ATTN_ROT_EMBD,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
MODEL_TENSOR.ATTN_ROT_EMBD,

@DocShotgun
Copy link
Contributor

DocShotgun commented Jun 12, 2025

Doing some local testing at the moment on ddh0's q4_0 quant. Text seems coherent so far and getting decent speed on RTX PRO 6000 96gb with 32k ctx allocated with q8_0 cache and flash attention:

load_tensors:        CUDA0 model buffer size = 76515.78 MiB
load_tensors:   CPU_Mapped model buffer size =   334.12 MiB
...
llama_context:  CUDA_Host  output buffer size =     0.58 MiB
llama_kv_cache_unified:      CUDA0 KV buffer size = 16864.00 MiB
llama_kv_cache_unified: size = 16864.00 MiB ( 32768 cells,  62 layers,  1 seqs), K (q8_0): 8432.00 MiB, V (q8_0): 8432.00 MiB
llama_context:      CUDA0 compute buffer size =   321.00 MiB
llama_context:  CUDA_Host compute buffer size =    72.01 MiB
...
prompt eval time =     241.08 ms /   416 tokens (    0.58 ms per token,  1725.58 tokens per second)
       eval time =     484.00 ms /    38 tokens (   12.74 ms per token,    78.51 tokens per second)
      total time =     725.08 ms /   454 tokens

I noticed the occasional Chinese character appearing mid-text or occasional typo/nonsense word with sampling settings of temp 1 and min-p 0.1. I'm not sure whether to attribute this to the model itself versus the q4_0 quantization versus the q8_0 cache versus the arch implementation. For example, the model invented the word "smirpsilon" with the tokens being sm + ir + psilon, and the logprobs look very strange at that position:
image

@jukofyork
Copy link
Collaborator

jukofyork commented Jun 12, 2025

Doing some local testing at the moment on ddh0's q4_0 quant. Text seems coherent so far and getting decent speed on RTX PRO 6000 96gb with 32k ctx allocated with q8_0 cache and flash attention:

load_tensors:        CUDA0 model buffer size = 76515.78 MiB
load_tensors:   CPU_Mapped model buffer size =   334.12 MiB
...
llama_context:  CUDA_Host  output buffer size =     0.58 MiB
llama_kv_cache_unified:      CUDA0 KV buffer size = 16864.00 MiB
llama_kv_cache_unified: size = 16864.00 MiB ( 32768 cells,  62 layers,  1 seqs), K (q8_0): 8432.00 MiB, V (q8_0): 8432.00 MiB
llama_context:      CUDA0 compute buffer size =   321.00 MiB
llama_context:  CUDA_Host compute buffer size =    72.01 MiB
...
prompt eval time =     241.08 ms /   416 tokens (    0.58 ms per token,  1725.58 tokens per second)
       eval time =     484.00 ms /    38 tokens (   12.74 ms per token,    78.51 tokens per second)
      total time =     725.08 ms /   454 tokens

I noticed the occasional Chinese character appearing mid-text or occasional typo/nonsense word with sampling settings of temp 1 and min-p 0.1. I'm not sure whether to attribute this to the model itself versus the q4_0 quantization versus the q8_0 cache versus the arch implementation. For example, the model invented the word "smirpsilon" with the tokens being sm + ir + psilon, and the logprobs look very strange at that position: image

That does look pretty odd - when qwen-2 and other Chinese models do this, then the Chinese word they insert usually makes sense when you translate it, but this just looks garbled.

It could be an overflow maybe? IIRC, the qwen-2 architecture suffered really badly from overflows; both here in llama.cpp and for people trying to generate exllamav2 quants (usually in the last couple of layers the activations would grow larger than the range of FP16).

@DocShotgun
Copy link
Contributor

It would be interesting to see if the vllm/transformers implementation has any issues like this. The logprobs make it look like the model is absolutely baffled at that token position - as none of the options that show up there are sane for what should follow "smir" lol.

@jukofyork
Copy link
Collaborator

jukofyork commented Jun 12, 2025

@DocShotgun I'm jealous of the RTX PRO 6000 96gb!

I had a Max-Q version ordered since March, but after scan delayed it for the 4th time:

https://www.scan.co.uk/shop/computer-hardware/gpu-nvidia-workstation/nvidia-workstation-visualisation-graphics-cards

I cancelled it because it will be sod's law the MaxQ OEM will be the very last one they get :/

(just noticed the dates have all moved again to mid/late July now!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants