-
Notifications
You must be signed in to change notification settings - Fork 294
Add Qwen3 Moe #2260
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add Qwen3 Moe #2260
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Took an initial pass. Let's try to clean up the config and state passing.
No passing an index down the layer stack, plus data structures that apply to the whole layer stack.
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for the Qwen3 MoE model. The implementation looks solid, covering the backbone, attention, decoder, tokenizer, and conversion scripts. I've identified several high-severity issues related to incomplete get_config
methods in various new layers, which will prevent model serialization from working correctly. There are also some medium-severity issues like unused parameters and a critical issue in the checkpoint conversion test script where an incorrect preprocessor is used. I've provided suggestions to fix these issues. Once addressed, the PR should be in great shape.
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for the Qwen3 MoE model, including its backbone, causal language model, tokenizer, and conversion scripts.
I've identified a few issues that need attention:
- A
critical
bug inQwen3MoeAttention.get_config()
that will cause anAttributeError
. - A couple of
high
severity issues where invalid parameters are used, which will lead toTypeError
exceptions. - Some
medium
severity issues in the test scripts and opportunities for code cleanup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, added few small comments.
Can you provide a colab demo with numerics verification and example usage code and also generate outputs matching? |
init_kwargs=self.init_kwargs, | ||
input_data=self.input_data, | ||
expected_output_shape=(2, 7, 16), | ||
run_quantization_check=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you enable this test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also add the missing test files for causal_lm_test and causal_lm_preprocessor_test
any updates here? |
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces the Qwen3-MoE model, including its backbone, causal language model, preprocessor, tokenizer, and associated conversion scripts. The implementation is comprehensive and well-structured. My review focuses on improving documentation to align with the project's style guide, addressing a potential bug related to a mutable default argument, and clarifying a preset mapping in the conversion script to prevent confusion.
tie_word_embeddings=False, | ||
sliding_window_size=32768, | ||
router_aux_loss_coefficient=0.001, | ||
mlp_only_layers=[], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mlp_only_layers
parameter uses a mutable list []
as its default value. This is a dangerous practice as the default list can be modified across different instantiations of the class, leading to unexpected behavior. It's better to use None
as the default and initialize an empty list inside the __init__
method if the argument is None
.
You should add the following check at the beginning of the __init__
method body (line 119):
if mlp_only_layers is None:
mlp_only_layers = []
mlp_only_layers=[], | |
mlp_only_layers=None, |
Args: | ||
num_query_heads: Number of query heads. | ||
num_key_value_heads: Number of key/value heads (for GQA). | ||
rope_max_wavelength: Maximum wavelength for RoPE (Rotary Position | ||
Embedding). | ||
rope_scaling_factor: Scaling factor for RoPE, used for extending | ||
context length. | ||
kernel_initializer: Initializer for the kernel weights. | ||
dropout: Dropout rate for attention weights. | ||
sliding_window_size: Size of the sliding window for attention. | ||
**kwargs: Additional keyword arguments to pass to the Layer. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring for __init__
is missing type hints for its arguments and is also missing documentation for the head_dim
and layer_norm_epsilon
parameters. According to the style guide, all public methods should have a complete docstring with type hints in the Args
section.1
A similar issue exists for the call
method's docstring.
Args:
num_query_heads: int. Number of query heads.
num_key_value_heads: int. Number of key/value heads (for GQA).
head_dim: int. The dimension of each attention head.
rope_max_wavelength: int. Maximum wavelength for RoPE (Rotary Position
Embedding).
rope_scaling_factor: float. Scaling factor for RoPE, used for extending
context length.
kernel_initializer: Initializer for the kernel weights.
dropout: float. Dropout rate for attention weights.
layer_norm_epsilon: float. The epsilon value for layer normalization.
sliding_window_size: int. Size of the sliding window for attention.
**kwargs: Additional keyword arguments to pass to the Layer.
Style Guide References
Footnotes
return out, router_logits | ||
|
||
|
||
class Qwen3MoeTransformerDecoder(keras.layers.Layer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The class Qwen3MoeTransformerDecoder
is missing a docstring. The style guide requires all public classes to have a complete Google-style docstring, including a summary, Args
section, and examples.1
This also applies to other new public classes and functions in this file (compute_load_balancing_loss
, Qwen3MoeMLP
, Qwen3MoeExperts
, Qwen3SparseMoeBlock
).
Style Guide References
Footnotes
"""A normalization layer for Qwen that implements RMS normalization.""" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The class docstring is missing an Args
section to document the __init__
parameters (head_dim
, epsilon
). The style guide requires all public classes to have a complete docstring.1
"""A normalization layer for Qwen that implements RMS normalization.
Args:
head_dim: int. The dimension of each attention head, used for per-head
normalization. Defaults to `None`.
epsilon: float. A small float added to variance to avoid dividing by
zero. Defaults to `1e-6`.
"""
Style Guide References
Footnotes
import keras_hub # noqa: E402 | ||
|
||
PRESET_MAP = { | ||
"qwen3_moe_3b_en": "Qwen/Qwen3-30B-A3B", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seems to be a mismatch in the PRESET_MAP
. The preset name "qwen3_moe_3b_en"
suggests a 3B parameter model, but it's mapped to the Hugging Face model "Qwen/Qwen3-30B-A3B"
, which is a 30B parameter model. This could be confusing. Please verify if this is intentional or a typo.
"qwen3_moe_3b_en": "Qwen/Qwen3-30B-A3B", | |
"qwen3_moe_30b_a3b_en": "Qwen/Qwen3-30B-A3B", |
Qwen3 Moe backbone output matching with atol 1e-3!
Generate output matching wise We are doing okay here! Generated token distribution is close in space to the huggingface ones. We saw similar issue in Qwen3 base models. Random seed used - 123
Keras generated text - What is Keras? Keras is a deep learning framework that is used for building and training neural networks. It is written in Python and can run on top
Keras token output tensor([[ 3838, 374, 730, 9247, 30, 730, 9247, 374, 264, 5538,
6832, 12626, 429, 374, 1483, 369, 4752, 323, 4862, 29728,
14155, 13, 1084, 374, 5326, 304, 13027, 323, 646, 1598,
389, 1909]], dtype=torch.int32)
HF Token outputs = tensor([[ 3838, 374, 730, 9247, 30, 3555, 374, 279, 6672, 1948,
730, 9247, 323, 94986, 30, 3555, 525, 279, 22146, 315,
730, 9247, 916, 94986, 30, 730, 9247, 374, 264, 1550,
11591, 29728]])