feat(dflash): support Qwen3.6-27B-DFlash draft (SWA layers) — 106 t/s on RTX 4090#94
feat(dflash): support Qwen3.6-27B-DFlash draft (SWA layers) — 106 t/s on RTX 4090#94Quitetall wants to merge 2 commits intoLuce-Org:mainfrom
Conversation
The z-lab/Qwen3.6-27B-DFlash draft model uses 4 sliding_attention + 1
full_attention layers (vs all full_attention in the 3.5 draft). This
caused the binary to silently misinterpret the model.
Changes:
- internal.h: add `is_swa` flag to DraftLayer, `swa_window` to DraftWeights
- safetensors_draft.cpp: parse config.json for `layer_types` and
`sliding_window`, set per-layer SWA flags. Prints diagnostic on load.
- qwen3_dflash_graph.cpp: plumb SWA mask through flash_attn_ext
(currently a no-op for short contexts where window > total_k,
which covers all typical DFlash use cases)
Results on RTX 4090 (Qwen3.6-27B Q4_K_M target):
3.5 draft (mismatch): 67.4 t/s, 3.76 tok/step, 23.5% acceptance
3.6 draft (matched): 106.1 t/s, 5.12 tok/step, 32.0% acceptance
^^^^^^^^
57% speedup from matched draft
Budget sweep (3.6 matched draft):
budget=6: 93.9 t/s, 4.27 tok/step
budget=10: 106.1 t/s, 5.12 tok/step ← sweet spot
budget=18: 102.7 t/s, 5.82 tok/step
budget=22: 103.5 t/s, 5.82 tok/step
Tested with: z-lab/Qwen3.6-27B-DFlash model.safetensors,
DFLASH27B_KV_K=q4_0, DFLASH27B_KV_V=q4_0, max-ctx=4096.
|
I noticed the llama.cpp qwen 3.6 draft implementation was complete, so I simply ported that over to lucebox with the help of claude opus. Code should be clean, reproducible. |
|
Note: This was benchmarked on Qwen-3.6-27B Heretic, not standard weights. Tok/s may be even higher on the standard model that the draft model was trained on. |
There was a problem hiding this comment.
1 issue found across 3 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="dflash/src/safetensors_draft.cpp">
<violation number="1" location="dflash/src/safetensors_draft.cpp:442">
P2: Config path derivation fails for bare filenames, so sibling `config.json` is never found and SWA metadata silently falls back to non-SWA defaults.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
When `path` is a bare filename like "model.safetensors" (no directory
prefix), `find_last_of('/')` returns npos and the dir variable kept
the full filename, producing "model.safetensors/config.json".
Fix: fall back to "." (CWD) when no slash is found, so config.json
is looked up in the same directory the user invoked from.
Thanks for testing that version; I heard it may be better at coding, so was looking at testing it myself on 3090. |
howard0su
left a comment
There was a problem hiding this comment.
Please double check the code.
| // For now, we pass nullptr and let full attention run — the mask | ||
| // setup requires knowing absolute positions which are in `in.positions_k`. | ||
| // TODO: implement mask fill in the caller or use ggml_diag_mask_inf | ||
| attn_mask = nullptr; // fallback to full attention until mask fill is wired |
There was a problem hiding this comment.
allocate something then immediately null it. I don't understand how this helps the performance?
The current safetensors loader for DFlash drafters accepts BF16
weights and F32 norms only. In practice some drafters for
Qwen3.6-27B are published as FP16 safetensors, with F16 norms.
This change extends `safetensors_draft.cpp` to:
* Accept tensors with dtype `F16` in addition to `BF16`. The new
F16 path goes through the same staged code that BF16 already
uses; the BF16 fast path is unchanged.
* Convert F16 norms to F32 on load when the consumer (graph)
requires F32 norm tensors. Avoids a separate copy step for
callers that already deal with F16-norm safetensors files.
* Keep the existing strictness around shape and missing tensors.
Mismatched dtype or missing keys still fail with the same
honest error; no silent fallback.
This is a correctness / compatibility change, not a kernel perf
change: the goal is to load existing FP16 drafter checkpoints
without conversion, not to claim a speedup over BF16. Numerical
parity is the validation goal here.
Validation:
* Built and ran with `models/dflash-drafter-fp16/model.safetensors`
on RTX 6000 Ada (sm_89), Heretic Q4_K_M target. The drafter
loads, prefill and decode both proceed without failure, and
output ids are identical to the matched BF16 drafter on a
fixed seed for the same prompt set (parity check).
* Existing BF16 drafter checkpoints continue to load through the
untouched fast path.
Verification vs existing community PRs:
DECIDIR. PR Luce-Org#94 (open, "support Qwen3.6-27B-DFlash draft (SWA
layers)", Quitetall) also touches `safetensors_draft.cpp` for
config.json parsing of layer_types and window. The dtype branch
here is orthogonal to the SWA parsing. If Luce-Org#94 lands first this
commit can be rebased on top of the unified F16/BF16 path; if
this lands first, Luce-Org#94 can rebase its config.json hunks on top
without touching the dtype branch. Maintainers, happy to
coordinate ordering.
Author: Javier Pazo <xabicasa@gmail.com>
Summary
Adds support for
z-lab/Qwen3.6-27B-DFlashwhich uses sliding window attention layers. The 3.6 matched draft gives 57% speedup over the mismatched 3.5 draft.Results (RTX 4090, Qwen3.6-27B Q4_K_M target)
Budget sweep (3.6 matched):
What changed
internal.h: Addedis_swaper-layer flag andswa_windowtoDraftWeights.safetensors_draft.cpp: Reads siblingconfig.jsonforlayer_typesandsliding_window. Setsis_swa=truefor"sliding_attention"layers. Prints diagnostic:qwen3_dflash_graph.cpp: Plumbs SWA mask throughggml_flash_attn_ext. Currently a no-op for short contexts (window=2048 > typical DFlash ctx_len), but the flag is ready for when longer contexts are used.How the 3.6 draft differs from 3.5
Same tensor names, same shapes, same architecture otherwise. The key difference is the weights are trained with sliding window attention patterns, giving better acceptance on the Qwen3.6 target.
Test
Builds on #89 (conv_input_cache fix, already merged).
🤖 Generated with Claude Code