-
Notifications
You must be signed in to change notification settings - Fork 12.1k
llama-model : add dots.llm1 architecture support (#14044) #14118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Forgot to mention, @ddh0 made some quants, although I think right now you should run these with https://huggingface.co/ddh0/dots.llm1.inst-GGUF-Q4_0-EXPERIMENTAL |
This commit adds support for "dots.llm1" (I decided to shorten it to dots1 or DOTS1 in the code generally) architecture. The only models that exist as of writing of this commit that follow this architecture are "dots.llm1.inst" and "dots.llm1.base" from here: * https://huggingface.co/rednote-hilab/dots.llm1.inst * https://huggingface.co/rednote-hilab/dots.llm1.base The model architecture is a combination of Qwen and Deepseek parts, as seen here: https://github.com/huggingface/transformers/blob/ffe12627b4e84489d2ab91dd0ec00614855edc79/src/transformers/models/dots1/modular_dots1.py --- Parts in this commit: Adding various "_DOTS1" constants around the codebase where a new architecture is expected. DotsModel in convert_hf_to_gguf.py to be used on Dots1ForCausalLM, to convert the model to .ggufs. It was made by following Qwen and DeepseekV2 converters (mostly Deepseek one was relevant). I added the graph code and architecture code in llama-model.cpp, it too was made by following qwen3 and deepseek codepaths, and doing some trial and error until coherent text came out. I added detection for the dots chat template so that it can pick it up. As of writing of this (10 June 2025) I did not have the opportunity to do more thorough testing than "prompt it and check does it respond with gibberish".
1c1517b
to
16dc0f4
Compare
Force-pushed a tiny fix for a linter error ^ |
I was able to run it:
|
@@ -5262,6 +5262,108 @@ def prepare_tensors(self): | |||
raise ValueError(f"Unprocessed experts: {experts}") | |||
|
|||
|
|||
@ModelBase.register("Dots1ForCausalLM") | |||
class DotsModel(TextModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a more simple way for the conversion code: https://github.com/ngxson/llama.cpp/blob/b469d9b86e148c4d7538ad27f817cf83bc2fb339/convert_hf_to_gguf.py#L3070-L3087
} else if (tmpl_contains("<|userprompt|>") && | ||
tmpl_contains("<|endofuserprompt|>") && | ||
tmpl_contains("<|response|>") && | ||
tmpl_contains("<|endofresponse|>")) { | ||
return LLM_CHAT_TEMPLATE_DOTS1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} else if (tmpl_contains("<|userprompt|>") && | |
tmpl_contains("<|endofuserprompt|>") && | |
tmpl_contains("<|response|>") && | |
tmpl_contains("<|endofresponse|>")) { | |
return LLM_CHAT_TEMPLATE_DOTS1; | |
} else if (tmpl_contains("<|endofuserprompt|>")) { | |
return LLM_CHAT_TEMPLATE_DOTS1; |
we don't need to check exhaustively since this is the only model which uses this <|endofuserprompt|>
marker
ss << "<|system|>" << message->content << "<|endofsystem|>"; | ||
} else if (role == "user") { | ||
ss << "<|userprompt|>" << message->content << "<|endofuserprompt|>"; | ||
} else if (role == "assistant") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} else if (role == "assistant") { | |
} else { |
To make it the same as other code blocks
MODEL_ARCH.DOTS1: [ | ||
MODEL_TENSOR.ROPE_FREQS, | ||
MODEL_TENSOR.ATTN_ROT_EMBD, | ||
], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is not necessary
MODEL_TENSOR.ATTN_K_NORM, | ||
MODEL_TENSOR.ATTN_V, | ||
MODEL_TENSOR.ATTN_OUT, | ||
MODEL_TENSOR.ATTN_ROT_EMBD, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MODEL_TENSOR.ATTN_ROT_EMBD, |
It would be interesting to see if the vllm/transformers implementation has any issues like this. The logprobs make it look like the model is absolutely baffled at that token position - as none of the options that show up there are sane for what should follow "smir" lol. |
@DocShotgun I'm jealous of the RTX PRO 6000 96gb! I had a I cancelled it because it will be sod's law the (just noticed the dates have all moved again to mid/late July now!) |
Add support for "dots.llm1" architecture. I decided to shorten that to dots1/DOTS1 in the code.
Tracking issue: #14044
These are the models that exist that use this:
https://huggingface.co/rednote-hilab/dots.llm1.inst
https://huggingface.co/rednote-hilab/dots.llm1.base
There is also a paper: https://github.com/rednote-hilab/dots.llm1/blob/main/dots1_tech_report.pdf (you can find this link in their Huggingface page).
And RedNote appears to have a GitHub page for this model as well: https://github.com/rednote-hilab/dots.llm1
The architecture has DeepseekV2+ MoE code but Qwen3 attention, kind of a mix:
https://github.com/huggingface/transformers/blob/ffe12627b4e84489d2ab91dd0ec00614855edc79/src/transformers/models/dots1/modular_dots1.py
The model is 32k context MoE model at 142B total parameters, and 14B activated parameters. It has its own new chat template and token for them.
I think this is maybe the lab's very first model, I see absolutely no other history from them and I've never heard of them before. The model itself seems fairly ok, similar smarts to other recent local models of this kind of size, but not sure I dare to make strong claims of if it is good or not when my experience is purely anecdotal.
This PR has:
_DOTS1
constants across wherever new architecture code is added.DotsModel
introduced toconvert_hf_to_gguf.py
to convert the models.The extent of my testing is that I've been checking the model doesn't break into gibberish even on long contexts, and that chat templates are used correctly (the rednote team fixed their HF safetensors model tokenization config files because they had the wrong EOS token based on this testing).
Some examples of prompting here: #14044 (comment)
I was planning to do some better verification before opening this PR than "prompt it and check it doesn't respond in gibberish" but did not have the time. I'm about to travel away for two weeks and I'm losing access to my Mac Studio I usually develop on, so I might have to ask someone to run perplexity tests or some double-checking on the computation graph. (That or maybe later I'll try rent some big memory server to do that checking myself). I think computation graph is likely is correct computation-wise, or almost correct because it hasn't visibly broken even for long (longest I've tried is about ~27k context length) prompts. Comparing with HF implementation would be nice too, I've done so before for GLM-4 when it came out recently. But I can do this verification myself later on.
The conversion code (convert_hf_to_gguf.py) and the computation graph code (llama-model.cpp) have been created by following Qwen3 and Deepseek code since the model architecture is using parts from them almost as-is. IMO the parts that could use more scrutiny in this PR are llm_build_dots1 and that
case LLM_ARCH_DOTS1:
where it is reading which tensors to get. And maybe checking that I didn't forget some parts that are usually added with a new architecture.I also wanted to check running on valgrind or some leak detection in case the graph code is leaking something (not sure if it would visibly warn you regardless; I didn't get compiler warnings about unused variables at least).
The transformers code I used as reference that the rednote team has made is still in a PR here, so this model is not yet part of the
transformers
Python library: huggingface/transformers#38143 (it's open as of me opening this PR here).