draft: sliding window attention for Qwen3.6 draft model by howard0su · Pull Request #129 · Luce-Org/lucebox-hub

howard0su · 2026-05-08T14:26:00Z

Implement per-layer K/V truncation in the draft graph for SWA layers (Qwen3.6 pattern: layers 0..n-2 use sliding window, last layer full, while Qwen3.5 draft model use all full layers).

When --draft-swa= is set and ctx_len > window, SWA layers:

Window target_feat to last positions via ggml_view_3d
Adjust positions_k to start from the window offset
Use eff_total_k = window + q_len for K/V reshape

Also raises the effective DRAFT_CTX_MAX when SWA is enabled so that ctx_len can exceed the window size and SWA truncation can activate.

Usage: --draft-swa=2048 or DFLASH27B_DRAFT_SWA=2048

Benchmark (RTX 2080 Ti, Qwen3.6-27B Q4_K_M + DFlash draft)

Context	Baseline AL	SWA=2048 AL	Δ AL	Baseline tok/s	SWA tok/s
6K	3.27	4.57	+40%	27.00	34.49
9K	4.92	5.57	+13%	41.16	42.01
16K	4.49	7.76	+73%	38.10	58.29
24K	4.34	4.41	+2%	35.91	33.10

Significant AL and throughput improvement at 6K–16K context. No regression on short prompts (HumanEval bench: identical AL at 5.47).

Implement per-layer K/V truncation in the draft graph for SWA layers (Qwen3.6 pattern: layers 0..n-2 use sliding window, last layer full). When --draft-swa=<window> is set and ctx_len > window, SWA layers: - Window target_feat to last <window> positions via ggml_view_3d - Adjust positions_k to start from the window offset - Use eff_total_k = window + q_len for K/V reshape This follows the same K/V truncation approach the target model uses for fa_window, avoiding the broken attn_mask approach from PR Luce-Org#94. Also raises the effective DRAFT_CTX_MAX when SWA is enabled so that ctx_len can exceed the window size and SWA truncation can activate. Usage: --draft-swa=2048 or DFLASH27B_DRAFT_SWA=2048 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

cubic-dev-ai

No issues found across 4 files

davide221 · 2026-05-08T17:19:03Z

Getting a bit of mixed results on rtx 3090

cubic-dev-ai Bot reviewed May 8, 2026

View reviewed changes

howard0su marked this pull request as draft May 8, 2026 23:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

draft: sliding window attention for Qwen3.6 draft model#129

draft: sliding window attention for Qwen3.6 draft model#129
howard0su wants to merge 1 commit intoLuce-Org:mainfrom
howard0su:swa

howard0su commented May 8, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

davide221 commented May 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

howard0su commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

davide221 commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

howard0su commented May 8, 2026 •

edited

Loading

davide221 commented May 8, 2026 •

edited

Loading