Qwen3.5 MoE support#120
Open
howard0su wants to merge 5 commits intoLuce-Org:mainfrom
Open
Conversation
There was a problem hiding this comment.
5 issues found across 14 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="dflash/test/smoke_moe_forward.cpp">
<violation number="1" location="dflash/test/smoke_moe_forward.cpp:188">
P1: NaN/Inf only breaks the loop; the smoke test still exits 0, hiding forward-pass corruption.</violation>
</file>
<file name="dflash/src/qwen35moe_target_graph.cpp">
<violation number="1" location="dflash/src/qwen35moe_target_graph.cpp:30">
P2: Mutable global debug/timing state is updated from the forward path without synchronization, so concurrent inference can race and trigger UB.</violation>
<violation number="2" location="dflash/src/qwen35moe_target_graph.cpp:309">
P2: Graph A CUDA-debug error path returns without ggml_free(ctx_a), causing a context leak</violation>
</file>
<file name="dflash/src/gguf_target_loader.cpp">
<violation number="1" location="dflash/src/gguf_target_loader.cpp:681">
P2: MoE expert-source setup should validate tensor shapes/strides against the metadata before trusting `nb[2]` and expert counts.</violation>
</file>
<file name="dflash/test/test_dflash.cpp">
<violation number="1" location="dflash/test/test_dflash.cpp:3121">
P1: MoE path ignores the draft backend and computes the draft graph on the target backend, breaking split target/draft GPU setups.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
- Add MoE expert weight management (moe_experts.h/cpp): MoeExpertSource for mmap layout, PinnedExperts for bulk H2D VRAM pinning - Add fused mega-graph MoE forward pass (qwen35moe_target_graph.cpp): single CUDA graph per (n_tokens, n_pinned) tuple, all 40 layers + lm_head - Add shared attention block builders (qwen35_blocks.h): full-attention, deltanet, and SwiGLU FFN used by both dense and MoE graph paths - Extend GGUF loader for MoE architecture fields (n_experts, expert_ffn_dim) - Add MoE generation harness (test_dflash_moe.cpp): budget=1 pure autoregressive, budget>1 speculative decode with DDTree support - Extract shared test helpers (test_helpers.h): DDTree, causal mask, top-K extraction reused by both dense and MoE test paths - No expert swapping — all weights pinned in VRAM (fits 22.5GB card) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace the per-expert view+add loop (14 nodes/layer) with a single ggml_repeat_back operation that sums along the expert dimension. Reduces CUDA graph nodes from 3358 to 2838 (-520 nodes, -15.5%). Steady-state decode: 12ms → 10.5ms/token (83 → 86+ tok/s). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove unnecessary ggml_cont and reorder graph nodes to enable ggml-cuda's built-in topk_moe fusion pattern: softmax → reshape → argsort → view → get_rows → norm chain Also enables mul_mat_id_glu fusion (gate+up+swiglu → 1 kernel). Results: - Fusion confirmed: 90+ tok/s with fusion vs 79 tok/s without - Steady-state: ~9.8ms/token (was 10.5ms) - CUDA graph nodes: 2798 (reduced from 2838) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The draft model (z-lab/Qwen3.6-35B-A3B-DFlash) was trained with YaRN rope scaling (factor=64, beta_fast=32, beta_slow=1) but we were using vanilla RoPE (ext_factor=0). This caused 81% of frequency dimensions to have incorrect position encodings, resulting in 0% acceptance rate. Fix: compute YaRN freq_factors per the HuggingFace config and pass them to ggml_rope_ext via a new optional rope_freq_factors field in DraftGraphInputs. Results: - Acceptance rate: 1/16 → 12.25/16 (76.6% avg, 100% on most steps) - Spec-decode throughput: 5 tok/s → 24 tok/s - Autoregressive baseline unchanged at 91 tok/s Also removes debug instrumentation from previous investigation. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements full MoE inference with expert weight swapping for the Qwen3.6-35B-A3B model (256 experts/layer, 8 active) on a single 2080 Ti. The model's expert weights (~18.6 GB) far exceed VRAM, so we use a two-graph-per-layer execution pattern with a dynamic LRU
cache and optional layer pinning.
Key Changes
Performance (64 tokens, budget=6, 2080 Ti 22GB)
Usage