Skip to content

draft: sliding window attention for Qwen3.6 draft model#129

Draft
howard0su wants to merge 1 commit intoLuce-Org:mainfrom
howard0su:swa
Draft

draft: sliding window attention for Qwen3.6 draft model#129
howard0su wants to merge 1 commit intoLuce-Org:mainfrom
howard0su:swa

Conversation

@howard0su
Copy link
Copy Markdown
Contributor

@howard0su howard0su commented May 8, 2026

Implement per-layer K/V truncation in the draft graph for SWA layers (Qwen3.6 pattern: layers 0..n-2 use sliding window, last layer full, while Qwen3.5 draft model use all full layers).

When --draft-swa= is set and ctx_len > window, SWA layers:

  • Window target_feat to last positions via ggml_view_3d
  • Adjust positions_k to start from the window offset
  • Use eff_total_k = window + q_len for K/V reshape

Also raises the effective DRAFT_CTX_MAX when SWA is enabled so that ctx_len can exceed the window size and SWA truncation can activate.

Usage: --draft-swa=2048 or DFLASH27B_DRAFT_SWA=2048

Benchmark (RTX 2080 Ti, Qwen3.6-27B Q4_K_M + DFlash draft)

Context Baseline AL SWA=2048 AL Δ AL Baseline tok/s SWA tok/s
6K 3.27 4.57 +40% 27.00 34.49
9K 4.92 5.57 +13% 41.16 42.01
16K 4.49 7.76 +73% 38.10 58.29
24K 4.34 4.41 +2% 35.91 33.10

Significant AL and throughput improvement at 6K–16K context. No regression on short prompts (HumanEval bench: identical AL at 5.47).

Implement per-layer K/V truncation in the draft graph for SWA layers
(Qwen3.6 pattern: layers 0..n-2 use sliding window, last layer full).

When --draft-swa=<window> is set and ctx_len > window, SWA layers:
- Window target_feat to last <window> positions via ggml_view_3d
- Adjust positions_k to start from the window offset
- Use eff_total_k = window + q_len for K/V reshape

This follows the same K/V truncation approach the target model uses
for fa_window, avoiding the broken attn_mask approach from PR Luce-Org#94.

Also raises the effective DRAFT_CTX_MAX when SWA is enabled so that
ctx_len can exceed the window size and SWA truncation can activate.

Usage: --draft-swa=2048 or DFLASH27B_DRAFT_SWA=2048

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 4 files

@davide221
Copy link
Copy Markdown
Contributor

davide221 commented May 8, 2026

Screenshot_2026-05-08_alle_19 15 39

Getting a bit of mixed results on rtx 3090

@howard0su howard0su marked this pull request as draft May 8, 2026 23:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants