-
Notifications
You must be signed in to change notification settings - Fork 4
Hunyuan tokenizer #26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
@coderabbitai pause |
✅ Actions performedReviews paused. |
if (arch == LLM_ARCH_HUNYUAN_MOE) { | ||
weights = ggml_reshape_2d(ctx0, weights, n_expert_used, n_tokens); // [n_expert_used, n_tokens] | ||
weights = ggml_div(ctx0, weights, ggml_sum_rows(ctx0, weights)); // [1, n_tokens] | ||
weights = ggml_reshape_3d(ctx0, weights, 1, n_expert_used, n_tokens); // [1, n_expert_used, n_tokens] | ||
cb(weights, "ffn_moe_weights_scaled", il); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
haha good catch, I didn't notice that I reinvented the norm_w
code block
gguf-py/gguf/gguf_writer.py
Outdated
def add_qk_norm(self, value: bool) -> None: | ||
self.add_bool(Keys.Attention.QK_NORM.format(arch=self.arch), value) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is redundant because we can just check the existence of k_norm.weight
tensor. I will remove this after the PR is merged
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
gguf-py/gguf/constants.py
Outdated
@@ -148,6 +148,7 @@ class Attention: | |||
VALUE_LENGTH_MLA = "{arch}.attention.value_length_mla" | |||
SHARED_KV_LAYERS = "{arch}.attention.shared_kv_layers" | |||
SLIDING_WINDOW_PATTERN = "{arch}.attention.sliding_window_pattern" | |||
QK_NORM = "{arch}.attention.qk_norm" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rm this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
convert_hf_to_gguf.py
Outdated
assert all(n == moe_shared_expert[0] for n in moe_shared_expert) | ||
self.gguf_writer.add_expert_shared_count(moe_shared_expert[0]) | ||
|
||
self.gguf_writer.add_qk_norm(hparams.get("use_qk_norm", True)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rm this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
convert_hf_to_gguf.py
Outdated
rope_scaling = hparams.get("rope_scaling", {}) | ||
if rope_scaling.get("type") == "dynamic": | ||
self.gguf_writer.add_rope_scaling_type(gguf.RopeScalingType.YARN) | ||
self.gguf_writer.add_rope_scaling_factor(rope_scaling["factor"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add an else: raise Error
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one I'm not really sure about. It effectively does nothing right now because the factor is 1 in the config. I would need to dig deeper to see if their "dynamic" really is YARN, and I really want to extend it to their claimed 256k. That will require more research after the logits line up.
As for the raise Error
, it seems other implementations just let it continue if rope is not defined, so I'll leave it as is for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did more research and refined this to generically handle their NTK Aware alpha scaling
special_vocab = gguf.SpecialVocab(self.dir_model, load_merges=False) | ||
special_vocab.add_to_gguf(self.gguf_writer) | ||
# FIX for BOS token: Manually set the correct BOS token ID. | ||
self.gguf_writer.add_bos_token_id(127959) # <|bos|> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can overwrite this in hparams["bos_token_id"]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if setting it in hparams would override the id that gguf.SpecialVocab
reads from the config. I've left this as is for now, but that can be tested later
Alright, I cleaned it up a bit, and it can be merged. Future considerations:
|
I just tested this PR and re-converted the safetensors to bf16 and seems to be working now in very limited testing. 👈 commands and logs## convert
python \
convert_hf_to_gguf.py \
--outtype bf16 \
--split-max-size 50G \
--outfile /mnt/raid/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/ \
/mnt/raid/models/tencent/Hunyuan-A13B-Instruct/
## run server
./build/bin/llama-server \
--model "$model" \
--alias ubergarm/Hunyuan-A13B-Instruct-bf16 \
-fa \
-ctk q8_0 -ctv q8_0 \
-c 8192 \
--jinja \
--temp 0.6 \
--presence-penalty 0.7 \
--min-p 0.1 \
-b 1024 \
-ts 48,48 \
-ngl 18 \
--threads 24 \
--host 127.0.0.1 \
--port 8080 Screenshot |
Make sure to read the contributing guidelines before submitting a PR