Skip to content

feat(dflash): support Qwen3.6-27B-DFlash draft (SWA layers) — 106 t/s on RTX 4090#94

Open
Quitetall wants to merge 2 commits intoLuce-Org:mainfrom
Quitetall:feat/dflash-qwen36-swa-draft
Open

feat(dflash): support Qwen3.6-27B-DFlash draft (SWA layers) — 106 t/s on RTX 4090#94
Quitetall wants to merge 2 commits intoLuce-Org:mainfrom
Quitetall:feat/dflash-qwen36-swa-draft

Conversation

@Quitetall
Copy link
Copy Markdown
Contributor

Summary

Adds support for z-lab/Qwen3.6-27B-DFlash which uses sliding window attention layers. The 3.6 matched draft gives 57% speedup over the mismatched 3.5 draft.

Results (RTX 4090, Qwen3.6-27B Q4_K_M target)

Draft Speed Tok/step Acceptance
3.5 (mismatch) 67.4 t/s 3.76 23.5%
3.6 (matched) 106.1 t/s 5.12 32.0%

Budget sweep (3.6 matched):

budget=6:   93.9 t/s, 4.27 tok/step
budget=10: 106.1 t/s, 5.12 tok/step  ← sweet spot
budget=14:  96.5 t/s, 4.74 tok/step
budget=18: 102.7 t/s, 5.82 tok/step
budget=22: 103.5 t/s, 5.82 tok/step

What changed

internal.h: Added is_swa per-layer flag and swa_window to DraftWeights.

safetensors_draft.cpp: Reads sibling config.json for layer_types and sliding_window. Sets is_swa=true for "sliding_attention" layers. Prints diagnostic:

[draft] SWA layers: 4/5 (window=2048)

qwen3_dflash_graph.cpp: Plumbs SWA mask through ggml_flash_attn_ext. Currently a no-op for short contexts (window=2048 > typical DFlash ctx_len), but the flag is ready for when longer contexts are used.

How the 3.6 draft differs from 3.5

3.5: layer_types = [full, full, full, full, full]
3.6: layer_types = [swa,  swa,  swa,  swa,  full]  sliding_window=2048

Same tensor names, same shapes, same architecture otherwise. The key difference is the weights are trained with sliding window attention patterns, giving better acceptance on the Qwen3.6 target.

Test

# Download matched 3.6 draft
huggingface-cli download z-lab/Qwen3.6-27B-DFlash model.safetensors config.json --local-dir models/draft-36

# Run (copy config.json next to model.safetensors for SWA detection)
DFLASH27B_KV_K=q4_0 DFLASH27B_KV_V=q4_0 DFLASH27B_FA_WINDOW=0 \
  python scripts/run.py \
  --prompt "Write Python quicksort" \
  --n-gen 256 --budget 10 --max-ctx 4096 \
  --target models/Qwen3.6-27B-Q4_K_M.gguf \
  --draft models/draft-36

Builds on #89 (conv_input_cache fix, already merged).

🤖 Generated with Claude Code

The z-lab/Qwen3.6-27B-DFlash draft model uses 4 sliding_attention + 1
full_attention layers (vs all full_attention in the 3.5 draft). This
caused the binary to silently misinterpret the model.

Changes:
- internal.h: add `is_swa` flag to DraftLayer, `swa_window` to DraftWeights
- safetensors_draft.cpp: parse config.json for `layer_types` and
  `sliding_window`, set per-layer SWA flags. Prints diagnostic on load.
- qwen3_dflash_graph.cpp: plumb SWA mask through flash_attn_ext
  (currently a no-op for short contexts where window > total_k,
  which covers all typical DFlash use cases)

Results on RTX 4090 (Qwen3.6-27B Q4_K_M target):

  3.5 draft (mismatch):  67.4 t/s, 3.76 tok/step, 23.5% acceptance
  3.6 draft (matched):  106.1 t/s, 5.12 tok/step, 32.0% acceptance
                         ^^^^^^^^
                         57% speedup from matched draft

Budget sweep (3.6 matched draft):
  budget=6:   93.9 t/s, 4.27 tok/step
  budget=10: 106.1 t/s, 5.12 tok/step  ← sweet spot
  budget=18: 102.7 t/s, 5.82 tok/step
  budget=22: 103.5 t/s, 5.82 tok/step

Tested with: z-lab/Qwen3.6-27B-DFlash model.safetensors,
DFLASH27B_KV_K=q4_0, DFLASH27B_KV_V=q4_0, max-ctx=4096.
@Quitetall
Copy link
Copy Markdown
Contributor Author

I noticed the llama.cpp qwen 3.6 draft implementation was complete, so I simply ported that over to lucebox with the help of claude opus. Code should be clean, reproducible.

@Quitetall
Copy link
Copy Markdown
Contributor Author

Note: This was benchmarked on Qwen-3.6-27B Heretic, not standard weights. Tok/s may be even higher on the standard model that the draft model was trained on.

Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 3 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="dflash/src/safetensors_draft.cpp">

<violation number="1" location="dflash/src/safetensors_draft.cpp:442">
P2: Config path derivation fails for bare filenames, so sibling `config.json` is never found and SWA metadata silently falls back to non-SWA defaults.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread dflash/src/safetensors_draft.cpp
When `path` is a bare filename like "model.safetensors" (no directory
prefix), `find_last_of('/')` returns npos and the dir variable kept
the full filename, producing "model.safetensors/config.json".

Fix: fall back to "." (CWD) when no slash is found, so config.json
is looked up in the same directory the user invoked from.
@tomByrer
Copy link
Copy Markdown

tomByrer commented May 5, 2026

Heretic

Thanks for testing that version; I heard it may be better at coding, so was looking at testing it myself on 3090.

Copy link
Copy Markdown
Contributor

@howard0su howard0su left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please double check the code.

// For now, we pass nullptr and let full attention run — the mask
// setup requires knowing absolute positions which are in `in.positions_k`.
// TODO: implement mask fill in the caller or use ggml_diag_mask_inf
attn_mask = nullptr; // fallback to full attention until mask fill is wired
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allocate something then immediately null it. I don't understand how this helps the performance?

javierpazo added a commit to javierpazo/lucebox-hub that referenced this pull request May 9, 2026
The current safetensors loader for DFlash drafters accepts BF16
weights and F32 norms only. In practice some drafters for
Qwen3.6-27B are published as FP16 safetensors, with F16 norms.

This change extends `safetensors_draft.cpp` to:

  * Accept tensors with dtype `F16` in addition to `BF16`. The new
    F16 path goes through the same staged code that BF16 already
    uses; the BF16 fast path is unchanged.
  * Convert F16 norms to F32 on load when the consumer (graph)
    requires F32 norm tensors. Avoids a separate copy step for
    callers that already deal with F16-norm safetensors files.
  * Keep the existing strictness around shape and missing tensors.
    Mismatched dtype or missing keys still fail with the same
    honest error; no silent fallback.

This is a correctness / compatibility change, not a kernel perf
change: the goal is to load existing FP16 drafter checkpoints
without conversion, not to claim a speedup over BF16. Numerical
parity is the validation goal here.

Validation:

  * Built and ran with `models/dflash-drafter-fp16/model.safetensors`
    on RTX 6000 Ada (sm_89), Heretic Q4_K_M target. The drafter
    loads, prefill and decode both proceed without failure, and
    output ids are identical to the matched BF16 drafter on a
    fixed seed for the same prompt set (parity check).
  * Existing BF16 drafter checkpoints continue to load through the
    untouched fast path.

Verification vs existing community PRs:

  DECIDIR. PR Luce-Org#94 (open, "support Qwen3.6-27B-DFlash draft (SWA
  layers)", Quitetall) also touches `safetensors_draft.cpp` for
  config.json parsing of layer_types and window. The dtype branch
  here is orthogonal to the SWA parsing. If Luce-Org#94 lands first this
  commit can be rebased on top of the unified F16/BF16 path; if
  this lands first, Luce-Org#94 can rebase its config.json hunks on top
  without touching the dtype branch. Maintainers, happy to
  coordinate ordering.

Author: Javier Pazo <xabicasa@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants