Skip to content

feat(dflash): wire caller-provided SWA mask through draft graph#140

Open
javierpazo wants to merge 1 commit intoLuce-Org:mainfrom
javierpazo:xabicasa/dflash-qwen36-swa-mask-wiring
Open

feat(dflash): wire caller-provided SWA mask through draft graph#140
javierpazo wants to merge 1 commit intoLuce-Org:mainfrom
javierpazo:xabicasa/dflash-qwen36-swa-mask-wiring

Conversation

@javierpazo
Copy link
Copy Markdown
Contributor

feat(dflash): wire caller-provided SWA mask through draft graph

When the Qwen3.6 draft graph is built for a context that exceeds
the SWA window, the caller-provided sliding-window attention mask
must reach ggml_flash_attn_ext on the SWA layers.

Previously the mask was constructed and then nullified
post-construction in qwen3_dflash_graph, so SWA layers ran without
the intended visibility constraint at long contexts.

This change makes the wiring explicit and pinable as a contract:

  • dflash_graph.hDraftGraphInputs gains an optional
    attn_mask field. Documented as caller-owned, type F16, with
    shape [kv_len, q_len] (or padded [kv_pad, q_pad]), values
    0 for visible positions and -inf for masked positions.
    Two helpers added so callers do not reimplement the same logic:

    bool draft_graph_needs_swa_mask(const DraftWeights & w,
    int ctx_len);
    void build_draft_swa_mask(std::vector<uint16_t> & out,
    int ctx_len, int q_len,
    int swa_window);

    lm_head is normalised to a default-null member at the same
    time (small consistency fix; layout unchanged).

  • qwen3_dflash_graph.cpp — when total_k > swa_window on a SWA
    layer, the graph wires the caller-provided mask through to
    ggml_flash_attn_ext and stops nullifying it. Layers that are
    not SWA still ignore the mask, as before.

  • smoke_draft_graph.cpp and test_vs_oracle.cpp — small
    alignment so the existing tests can build / fill the draft SWA
    mask when ctx_len + q_len > swa_window. No new test scaffolding
    is added in this commit; the focused regression test lives in a
    separate PR (test(dflash): contract test for draft SWA mask wiring) so each PR keeps one concern.

Validation:

  • Built and ran smoke_draft_graph and test_vs_oracle on
    RTX 6000 Ada (sm_89), Heretic Q4_K_M target, FP16 safetensors
    drafter, FA_WINDOW=0. Both tests pass before and after; the
    behaviour at ctx_len <= swa_window is unchanged (mask not
    needed and not consumed).
  • At long context the SWA layers now respect the caller mask.

Verification vs existing community PRs:

COMP-COMPL with PR #94 (open, "support Qwen3.6-27B-DFlash draft
(SWA layers)", Quitetall) and PR #129 (open Draft, "sliding
window attention for Qwen3.6 draft model", howard0su).

The interface added here (caller-mask field + helpers) is small
enough that it can survive either approach landing first. If
PR #94 lands first, this commit should rebase cleanly because it
formalises the mask path #94 already needs internally; if
PR #129 lands first, the mask path here remains useful for
callers that prefer mask semantics. Maintainers, happy to
coordinate ordering.

Author: Javier Pazo xabicasa@gmail.com

When the Qwen3.6 draft graph is built for a context that exceeds
the SWA window, the caller-provided sliding-window attention mask
must reach `ggml_flash_attn_ext` on the SWA layers.

Previously the mask was constructed and then nullified
post-construction in qwen3_dflash_graph, so SWA layers ran without
the intended visibility constraint at long contexts.

This change makes the wiring explicit and pinable as a contract:

  * `dflash_graph.h` — `DraftGraphInputs` gains an optional
    `attn_mask` field. Documented as caller-owned, type F16, with
    shape `[kv_len, q_len]` (or padded `[kv_pad, q_pad]`), values
    `0` for visible positions and `-inf` for masked positions.
    Two helpers added so callers do not reimplement the same logic:

      bool draft_graph_needs_swa_mask(const DraftWeights & w,
                                      int ctx_len);
      void build_draft_swa_mask(std::vector<uint16_t> & out,
                                int ctx_len, int q_len,
                                int swa_window);

    `lm_head` is normalised to a default-null member at the same
    time (small consistency fix; layout unchanged).

  * `qwen3_dflash_graph.cpp` — when `total_k > swa_window` on a SWA
    layer, the graph wires the caller-provided mask through to
    `ggml_flash_attn_ext` and stops nullifying it. Layers that are
    not SWA still ignore the mask, as before.

  * `smoke_draft_graph.cpp` and `test_vs_oracle.cpp` — small
    alignment so the existing tests can build / fill the draft SWA
    mask when `ctx_len + q_len > swa_window`. No new test scaffolding
    is added in this commit; the focused regression test lives in a
    separate PR (`test(dflash): contract test for draft SWA mask
    wiring`) so each PR keeps one concern.

Validation:

  * Built and ran `smoke_draft_graph` and `test_vs_oracle` on
    RTX 6000 Ada (sm_89), Heretic Q4_K_M target, FP16 safetensors
    drafter, FA_WINDOW=0. Both tests pass before and after; the
    behaviour at ctx_len <= swa_window is unchanged (mask not
    needed and not consumed).
  * At long context the SWA layers now respect the caller mask.

Verification vs existing community PRs:

  COMP-COMPL with PR Luce-Org#94 (open, "support Qwen3.6-27B-DFlash draft
  (SWA layers)", Quitetall) and PR Luce-Org#129 (open Draft, "sliding
  window attention for Qwen3.6 draft model", howard0su).

  * PR Luce-Org#94 wires SWA via masks (same family as this PR).
  * PR Luce-Org#129 wires SWA via per-layer K/V truncation instead.

  The interface added here (caller-mask field + helpers) is small
  enough that it can survive either approach landing first. If
  PR Luce-Org#94 lands first, this commit should rebase cleanly because it
  formalises the mask path Luce-Org#94 already needs internally; if
  PR Luce-Org#129 lands first, the mask path here remains useful for
  callers that prefer mask semantics. Maintainers, happy to
  coordinate ordering.

Author: Javier Pazo <xabicasa@gmail.com>
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 4 files

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant