feat(dflash): wire caller-provided SWA mask through draft graph by javierpazo · Pull Request #140 · Luce-Org/lucebox-hub

javierpazo · 2026-05-09T12:14:02Z

feat(dflash): wire caller-provided SWA mask through draft graph

When the Qwen3.6 draft graph is built for a context that exceeds
the SWA window, the caller-provided sliding-window attention mask
must reach ggml_flash_attn_ext on the SWA layers.

Previously the mask was constructed and then nullified
post-construction in qwen3_dflash_graph, so SWA layers ran without
the intended visibility constraint at long contexts.

This change makes the wiring explicit and pinable as a contract:

dflash_graph.h — DraftGraphInputs gains an optional
attn_mask field. Documented as caller-owned, type F16, with
shape [kv_len, q_len] (or padded [kv_pad, q_pad]), values
0 for visible positions and -inf for masked positions.
Two helpers added so callers do not reimplement the same logic:

bool draft_graph_needs_swa_mask(const DraftWeights & w,
int ctx_len);
void build_draft_swa_mask(std::vector<uint16_t> & out,
int ctx_len, int q_len,
int swa_window);

lm_head is normalised to a default-null member at the same
time (small consistency fix; layout unchanged).
qwen3_dflash_graph.cpp — when total_k > swa_window on a SWA
layer, the graph wires the caller-provided mask through to
ggml_flash_attn_ext and stops nullifying it. Layers that are
not SWA still ignore the mask, as before.
smoke_draft_graph.cpp and test_vs_oracle.cpp — small
alignment so the existing tests can build / fill the draft SWA
mask when ctx_len + q_len > swa_window. No new test scaffolding
is added in this commit; the focused regression test lives in a
separate PR (test(dflash): contract test for draft SWA mask wiring) so each PR keeps one concern.

Validation:

Built and ran smoke_draft_graph and test_vs_oracle on
RTX 6000 Ada (sm_89), Heretic Q4_K_M target, FP16 safetensors
drafter, FA_WINDOW=0. Both tests pass before and after; the
behaviour at ctx_len <= swa_window is unchanged (mask not
needed and not consumed).
At long context the SWA layers now respect the caller mask.

Verification vs existing community PRs:

COMP-COMPL with PR #94 (open, "support Qwen3.6-27B-DFlash draft
(SWA layers)", Quitetall) and PR #129 (open Draft, "sliding
window attention for Qwen3.6 draft model", howard0su).

PR feat(dflash): support Qwen3.6-27B-DFlash draft (SWA layers) — 106 t/s on RTX 4090 #94 wires SWA via masks (same family as this PR).
PR draft: sliding window attention for Qwen3.6 draft model #129 wires SWA via per-layer K/V truncation instead.

The interface added here (caller-mask field + helpers) is small
enough that it can survive either approach landing first. If
PR #94 lands first, this commit should rebase cleanly because it
formalises the mask path #94 already needs internally; if
PR #129 lands first, the mask path here remains useful for
callers that prefer mask semantics. Maintainers, happy to
coordinate ordering.

Author: Javier Pazo xabicasa@gmail.com

When the Qwen3.6 draft graph is built for a context that exceeds the SWA window, the caller-provided sliding-window attention mask must reach `ggml_flash_attn_ext` on the SWA layers. Previously the mask was constructed and then nullified post-construction in qwen3_dflash_graph, so SWA layers ran without the intended visibility constraint at long contexts. This change makes the wiring explicit and pinable as a contract: * `dflash_graph.h` — `DraftGraphInputs` gains an optional `attn_mask` field. Documented as caller-owned, type F16, with shape `[kv_len, q_len]` (or padded `[kv_pad, q_pad]`), values `0` for visible positions and `-inf` for masked positions. Two helpers added so callers do not reimplement the same logic: bool draft_graph_needs_swa_mask(const DraftWeights & w, int ctx_len); void build_draft_swa_mask(std::vector<uint16_t> & out, int ctx_len, int q_len, int swa_window); `lm_head` is normalised to a default-null member at the same time (small consistency fix; layout unchanged). * `qwen3_dflash_graph.cpp` — when `total_k > swa_window` on a SWA layer, the graph wires the caller-provided mask through to `ggml_flash_attn_ext` and stops nullifying it. Layers that are not SWA still ignore the mask, as before. * `smoke_draft_graph.cpp` and `test_vs_oracle.cpp` — small alignment so the existing tests can build / fill the draft SWA mask when `ctx_len + q_len > swa_window`. No new test scaffolding is added in this commit; the focused regression test lives in a separate PR (`test(dflash): contract test for draft SWA mask wiring`) so each PR keeps one concern. Validation: * Built and ran `smoke_draft_graph` and `test_vs_oracle` on RTX 6000 Ada (sm_89), Heretic Q4_K_M target, FP16 safetensors drafter, FA_WINDOW=0. Both tests pass before and after; the behaviour at ctx_len <= swa_window is unchanged (mask not needed and not consumed). * At long context the SWA layers now respect the caller mask. Verification vs existing community PRs: COMP-COMPL with PR Luce-Org#94 (open, "support Qwen3.6-27B-DFlash draft (SWA layers)", Quitetall) and PR Luce-Org#129 (open Draft, "sliding window attention for Qwen3.6 draft model", howard0su). * PR Luce-Org#94 wires SWA via masks (same family as this PR). * PR Luce-Org#129 wires SWA via per-layer K/V truncation instead. The interface added here (caller-mask field + helpers) is small enough that it can survive either approach landing first. If PR Luce-Org#94 lands first, this commit should rebase cleanly because it formalises the mask path Luce-Org#94 already needs internally; if PR Luce-Org#129 lands first, the mask path here remains useful for callers that prefer mask semantics. Maintainers, happy to coordinate ordering. Author: Javier Pazo <xabicasa@gmail.com>

cubic-dev-ai

No issues found across 4 files

cubic-dev-ai Bot reviewed May 9, 2026

View reviewed changes

dusterbloom mentioned this pull request May 10, 2026

Gemma4 support: pFlash + DFlash + chunked prefill, daemon mode, server routing #131

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(dflash): wire caller-provided SWA mask through draft graph#140

feat(dflash): wire caller-provided SWA mask through draft graph#140
javierpazo wants to merge 1 commit intoLuce-Org:mainfrom
javierpazo:xabicasa/dflash-qwen36-swa-mask-wiring

javierpazo commented May 9, 2026

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

javierpazo commented May 9, 2026

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant