feat(dflash): wire caller-provided SWA mask through draft graph#140
Open
javierpazo wants to merge 1 commit intoLuce-Org:mainfrom
Open
feat(dflash): wire caller-provided SWA mask through draft graph#140javierpazo wants to merge 1 commit intoLuce-Org:mainfrom
javierpazo wants to merge 1 commit intoLuce-Org:mainfrom
Conversation
When the Qwen3.6 draft graph is built for a context that exceeds
the SWA window, the caller-provided sliding-window attention mask
must reach `ggml_flash_attn_ext` on the SWA layers.
Previously the mask was constructed and then nullified
post-construction in qwen3_dflash_graph, so SWA layers ran without
the intended visibility constraint at long contexts.
This change makes the wiring explicit and pinable as a contract:
* `dflash_graph.h` — `DraftGraphInputs` gains an optional
`attn_mask` field. Documented as caller-owned, type F16, with
shape `[kv_len, q_len]` (or padded `[kv_pad, q_pad]`), values
`0` for visible positions and `-inf` for masked positions.
Two helpers added so callers do not reimplement the same logic:
bool draft_graph_needs_swa_mask(const DraftWeights & w,
int ctx_len);
void build_draft_swa_mask(std::vector<uint16_t> & out,
int ctx_len, int q_len,
int swa_window);
`lm_head` is normalised to a default-null member at the same
time (small consistency fix; layout unchanged).
* `qwen3_dflash_graph.cpp` — when `total_k > swa_window` on a SWA
layer, the graph wires the caller-provided mask through to
`ggml_flash_attn_ext` and stops nullifying it. Layers that are
not SWA still ignore the mask, as before.
* `smoke_draft_graph.cpp` and `test_vs_oracle.cpp` — small
alignment so the existing tests can build / fill the draft SWA
mask when `ctx_len + q_len > swa_window`. No new test scaffolding
is added in this commit; the focused regression test lives in a
separate PR (`test(dflash): contract test for draft SWA mask
wiring`) so each PR keeps one concern.
Validation:
* Built and ran `smoke_draft_graph` and `test_vs_oracle` on
RTX 6000 Ada (sm_89), Heretic Q4_K_M target, FP16 safetensors
drafter, FA_WINDOW=0. Both tests pass before and after; the
behaviour at ctx_len <= swa_window is unchanged (mask not
needed and not consumed).
* At long context the SWA layers now respect the caller mask.
Verification vs existing community PRs:
COMP-COMPL with PR Luce-Org#94 (open, "support Qwen3.6-27B-DFlash draft
(SWA layers)", Quitetall) and PR Luce-Org#129 (open Draft, "sliding
window attention for Qwen3.6 draft model", howard0su).
* PR Luce-Org#94 wires SWA via masks (same family as this PR).
* PR Luce-Org#129 wires SWA via per-layer K/V truncation instead.
The interface added here (caller-mask field + helpers) is small
enough that it can survive either approach landing first. If
PR Luce-Org#94 lands first, this commit should rebase cleanly because it
formalises the mask path Luce-Org#94 already needs internally; if
PR Luce-Org#129 lands first, the mask path here remains useful for
callers that prefer mask semantics. Maintainers, happy to
coordinate ordering.
Author: Javier Pazo <xabicasa@gmail.com>
6 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
feat(dflash): wire caller-provided SWA mask through draft graph
When the Qwen3.6 draft graph is built for a context that exceeds
the SWA window, the caller-provided sliding-window attention mask
must reach
ggml_flash_attn_exton the SWA layers.Previously the mask was constructed and then nullified
post-construction in qwen3_dflash_graph, so SWA layers ran without
the intended visibility constraint at long contexts.
This change makes the wiring explicit and pinable as a contract:
dflash_graph.h—DraftGraphInputsgains an optionalattn_maskfield. Documented as caller-owned, type F16, withshape
[kv_len, q_len](or padded[kv_pad, q_pad]), values0for visible positions and-inffor masked positions.Two helpers added so callers do not reimplement the same logic:
bool draft_graph_needs_swa_mask(const DraftWeights & w,
int ctx_len);
void build_draft_swa_mask(std::vector<uint16_t> & out,
int ctx_len, int q_len,
int swa_window);
lm_headis normalised to a default-null member at the sametime (small consistency fix; layout unchanged).
qwen3_dflash_graph.cpp— whentotal_k > swa_windowon a SWAlayer, the graph wires the caller-provided mask through to
ggml_flash_attn_extand stops nullifying it. Layers that arenot SWA still ignore the mask, as before.
smoke_draft_graph.cppandtest_vs_oracle.cpp— smallalignment so the existing tests can build / fill the draft SWA
mask when
ctx_len + q_len > swa_window. No new test scaffoldingis added in this commit; the focused regression test lives in a
separate PR (
test(dflash): contract test for draft SWA mask wiring) so each PR keeps one concern.Validation:
smoke_draft_graphandtest_vs_oracleonRTX 6000 Ada (sm_89), Heretic Q4_K_M target, FP16 safetensors
drafter, FA_WINDOW=0. Both tests pass before and after; the
behaviour at ctx_len <= swa_window is unchanged (mask not
needed and not consumed).
Verification vs existing community PRs:
COMP-COMPL with PR #94 (open, "support Qwen3.6-27B-DFlash draft
(SWA layers)", Quitetall) and PR #129 (open Draft, "sliding
window attention for Qwen3.6 draft model", howard0su).
The interface added here (caller-mask field + helpers) is small
enough that it can survive either approach landing first. If
PR #94 lands first, this commit should rebase cleanly because it
formalises the mask path #94 already needs internally; if
PR #129 lands first, the mask path here remains useful for
callers that prefer mask semantics. Maintainers, happy to
coordinate ordering.
Author: Javier Pazo xabicasa@gmail.com