feat(dflash): support Qwen3.6-27B-DFlash draft (SWA layers) — 106 t/s on RTX 4090 by Quitetall · Pull Request #94 · Luce-Org/lucebox-hub

Quitetall · 2026-05-04T15:26:00Z

Summary

Adds support for z-lab/Qwen3.6-27B-DFlash which uses sliding window attention layers. The 3.6 matched draft gives 57% speedup over the mismatched 3.5 draft.

Results (RTX 4090, Qwen3.6-27B Q4_K_M target)

Draft	Speed	Tok/step	Acceptance
3.5 (mismatch)	67.4 t/s	3.76	23.5%
3.6 (matched)	106.1 t/s	5.12	32.0%

Budget sweep (3.6 matched):

budget=6:   93.9 t/s, 4.27 tok/step
budget=10: 106.1 t/s, 5.12 tok/step  ← sweet spot
budget=14:  96.5 t/s, 4.74 tok/step
budget=18: 102.7 t/s, 5.82 tok/step
budget=22: 103.5 t/s, 5.82 tok/step

What changed

internal.h: Added is_swa per-layer flag and swa_window to DraftWeights.

safetensors_draft.cpp: Reads sibling config.json for layer_types and sliding_window. Sets is_swa=true for "sliding_attention" layers. Prints diagnostic:

[draft] SWA layers: 4/5 (window=2048)

qwen3_dflash_graph.cpp: Plumbs SWA mask through ggml_flash_attn_ext. Currently a no-op for short contexts (window=2048 > typical DFlash ctx_len), but the flag is ready for when longer contexts are used.

How the 3.6 draft differs from 3.5

3.5: layer_types = [full, full, full, full, full]
3.6: layer_types = [swa,  swa,  swa,  swa,  full]  sliding_window=2048

Same tensor names, same shapes, same architecture otherwise. The key difference is the weights are trained with sliding window attention patterns, giving better acceptance on the Qwen3.6 target.

Test

# Download matched 3.6 draft
huggingface-cli download z-lab/Qwen3.6-27B-DFlash model.safetensors config.json --local-dir models/draft-36

# Run (copy config.json next to model.safetensors for SWA detection)
DFLASH27B_KV_K=q4_0 DFLASH27B_KV_V=q4_0 DFLASH27B_FA_WINDOW=0 \
  python scripts/run.py \
  --prompt "Write Python quicksort" \
  --n-gen 256 --budget 10 --max-ctx 4096 \
  --target models/Qwen3.6-27B-Q4_K_M.gguf \
  --draft models/draft-36

Builds on #89 (conv_input_cache fix, already merged).

🤖 Generated with Claude Code

The z-lab/Qwen3.6-27B-DFlash draft model uses 4 sliding_attention + 1 full_attention layers (vs all full_attention in the 3.5 draft). This caused the binary to silently misinterpret the model. Changes: - internal.h: add `is_swa` flag to DraftLayer, `swa_window` to DraftWeights - safetensors_draft.cpp: parse config.json for `layer_types` and `sliding_window`, set per-layer SWA flags. Prints diagnostic on load. - qwen3_dflash_graph.cpp: plumb SWA mask through flash_attn_ext (currently a no-op for short contexts where window > total_k, which covers all typical DFlash use cases) Results on RTX 4090 (Qwen3.6-27B Q4_K_M target): 3.5 draft (mismatch): 67.4 t/s, 3.76 tok/step, 23.5% acceptance 3.6 draft (matched): 106.1 t/s, 5.12 tok/step, 32.0% acceptance ^^^^^^^^ 57% speedup from matched draft Budget sweep (3.6 matched draft): budget=6: 93.9 t/s, 4.27 tok/step budget=10: 106.1 t/s, 5.12 tok/step ← sweet spot budget=18: 102.7 t/s, 5.82 tok/step budget=22: 103.5 t/s, 5.82 tok/step Tested with: z-lab/Qwen3.6-27B-DFlash model.safetensors, DFLASH27B_KV_K=q4_0, DFLASH27B_KV_V=q4_0, max-ctx=4096.

Quitetall · 2026-05-04T15:27:24Z

I noticed the llama.cpp qwen 3.6 draft implementation was complete, so I simply ported that over to lucebox with the help of claude opus. Code should be clean, reproducible.

Quitetall · 2026-05-04T15:29:01Z

Note: This was benchmarked on Qwen-3.6-27B Heretic, not standard weights. Tok/s may be even higher on the standard model that the draft model was trained on.

cubic-dev-ai

1 issue found across 3 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="dflash/src/safetensors_draft.cpp">

<violation number="1" location="dflash/src/safetensors_draft.cpp:442">
P2: Config path derivation fails for bare filenames, so sibling `config.json` is never found and SWA metadata silently falls back to non-SWA defaults.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

When `path` is a bare filename like "model.safetensors" (no directory prefix), `find_last_of('/')` returns npos and the dir variable kept the full filename, producing "model.safetensors/config.json". Fix: fall back to "." (CWD) when no slash is found, so config.json is looked up in the same directory the user invoked from.

tomByrer · 2026-05-05T05:02:17Z

Heretic

Thanks for testing that version; I heard it may be better at coding, so was looking at testing it myself on 3090.

howard0su

Please double check the code.

howard0su · 2026-05-08T13:28:29Z

+            // For now, we pass nullptr and let full attention run — the mask
+            // setup requires knowing absolute positions which are in `in.positions_k`.
+            // TODO: implement mask fill in the caller or use ggml_diag_mask_inf
+            attn_mask = nullptr; // fallback to full attention until mask fill is wired


allocate something then immediately null it. I don't understand how this helps the performance?

The current safetensors loader for DFlash drafters accepts BF16 weights and F32 norms only. In practice some drafters for Qwen3.6-27B are published as FP16 safetensors, with F16 norms. This change extends `safetensors_draft.cpp` to: * Accept tensors with dtype `F16` in addition to `BF16`. The new F16 path goes through the same staged code that BF16 already uses; the BF16 fast path is unchanged. * Convert F16 norms to F32 on load when the consumer (graph) requires F32 norm tensors. Avoids a separate copy step for callers that already deal with F16-norm safetensors files. * Keep the existing strictness around shape and missing tensors. Mismatched dtype or missing keys still fail with the same honest error; no silent fallback. This is a correctness / compatibility change, not a kernel perf change: the goal is to load existing FP16 drafter checkpoints without conversion, not to claim a speedup over BF16. Numerical parity is the validation goal here. Validation: * Built and ran with `models/dflash-drafter-fp16/model.safetensors` on RTX 6000 Ada (sm_89), Heretic Q4_K_M target. The drafter loads, prefill and decode both proceed without failure, and output ids are identical to the matched BF16 drafter on a fixed seed for the same prompt set (parity check). * Existing BF16 drafter checkpoints continue to load through the untouched fast path. Verification vs existing community PRs: DECIDIR. PR Luce-Org#94 (open, "support Qwen3.6-27B-DFlash draft (SWA layers)", Quitetall) also touches `safetensors_draft.cpp` for config.json parsing of layer_types and window. The dtype branch here is orthogonal to the SWA parsing. If Luce-Org#94 lands first this commit can be rebased on top of the unified F16/BF16 path; if this lands first, Luce-Org#94 can rebase its config.json hunks on top without touching the dtype branch. Maintainers, happy to coordinate ordering. Author: Javier Pazo <xabicasa@gmail.com>

cubic-dev-ai Bot reviewed May 4, 2026

View reviewed changes

Comment thread dflash/src/safetensors_draft.cpp

howard0su suggested changes May 8, 2026

View reviewed changes

This was referenced May 9, 2026

feat(dflash): wire caller-provided SWA mask through draft graph #140

Open

test(dflash): contract test for draft SWA mask wiring #141

Open

feat(dflash): accept FP16 safetensors drafter alongside BF16 #142

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(dflash): support Qwen3.6-27B-DFlash draft (SWA layers) — 106 t/s on RTX 4090#94

feat(dflash): support Qwen3.6-27B-DFlash draft (SWA layers) — 106 t/s on RTX 4090#94
Quitetall wants to merge 2 commits intoLuce-Org:mainfrom
Quitetall:feat/dflash-qwen36-swa-draft

Quitetall commented May 4, 2026

Uh oh!

Quitetall commented May 4, 2026

Uh oh!

Quitetall commented May 4, 2026

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

Uh oh!

tomByrer commented May 5, 2026

Uh oh!

howard0su left a comment

Uh oh!

howard0su May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Quitetall commented May 4, 2026

Summary

Results (RTX 4090, Qwen3.6-27B Q4_K_M target)

What changed

How the 3.6 draft differs from 3.5

Test

Uh oh!

Quitetall commented May 4, 2026

Uh oh!

Quitetall commented May 4, 2026

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tomByrer commented May 5, 2026

Uh oh!

howard0su left a comment

Choose a reason for hiding this comment

Uh oh!

howard0su May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants