-
Notifications
You must be signed in to change notification settings - Fork 171
Gemma4 support: pFlash + DFlash + chunked prefill, daemon mode, server routing #131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
dusterbloom
wants to merge
41
commits into
Luce-Org:main
Choose a base branch
from
dusterbloom:feature/gemma4-support
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
41 commits
Select commit
Hold shift + click to select a range
4fe62b5
feat: add Gemma4 target + draft model support (26B-A4B MoE & 31B dense)
dusterbloom 85a1196
test: add Gemma4 TDD smoke tests (all GREEN)
dusterbloom 978ca01
fix: use correct capture_layer_ids from DFlash draft config.json
dusterbloom 3335ee2
feat: implement draft KV cache for Gemma4 DFlash speculative decoding
dusterbloom 7ce68ac
perf: chunked batched prefill for Gemma4 target (12-16x speedup)
dusterbloom 1ef975f
refactor: update tokenize_prompt.py for Gemma4 with CSV output mode
dusterbloom 133017d
feat: add Q8_0 quantization script for Gemma4 DFlash draft model
dusterbloom 1386690
fix: correct Gemma4 DFlash decode — BOS, EOS, and SWA mask
dusterbloom 33b6e9d
feat: add GGUF draft loader for Gemma4 DFlash + parameterize quantize…
dusterbloom d2a2c04
feat: implement Gemma4 pFlash prefill — layer-by-layer block-sparse a…
dusterbloom 9588c97
fix: add pFlash CLI flags, --tokens-file, and prevent draft KV overflow
dusterbloom c15f93a
refactor: remove standalone ggml_turbo_wht calls — rotation now fused…
dusterbloom f2c36bc
feat: SWA-aware KV cache allocation with ring-buffer for 64K+ context
dusterbloom f2261bd
fix: draft KV ring-buffer wrap instead of crash on overflow
dusterbloom 333f4e0
perf: hybrid pFlash prefill — batched SWA groups + GRAPH_CHUNK=32K
dusterbloom 488190e
feat: wire pFlash into Gemma4 chunked prefill via ggml_flash_attn_sparse
dusterbloom 1017dac
feat: gate pFlash dispatch on supported KV types + buffer-NULL guards
dusterbloom 5b6ba1b
fix: SWA mask coordinate frame — chunks 2+ were silently corrupted
dusterbloom 9097311
feat: add daemon mode to test_gemma4_dflash + server.py routing
dusterbloom 8fa5cd0
chore: point submodule to dusterbloom fork on feature/tq3-kv-cache
dusterbloom 5fb516d
fix: address 11 P2 review violations + draft KV rolling window
dusterbloom 3b4a4cb
Merge remote-tracking branch 'origin/main' into feature/gemma4-support
dusterbloom 8ff5c77
chore: bump submodule for S-buffer probe instrumentation
dusterbloom 19def9c
fix(gemma4): disable SWA ring opt + add 256-align snap for multi-chunk
dusterbloom d68e7c4
feat(gemma4): non-monotonic SWA ring restores VRAM savings
dusterbloom ce4da35
feat(gemma4): narrow asymmetric KV (TQ3 → Q8 on captured full-attn)
dusterbloom 2cb6ec6
chore: bump submodule for TQ3 → f16 dequant + MMA fast path
dusterbloom cf76b73
fix(test): auto-prefer Q8 GGUF drafter over BF16 safetensors
dusterbloom 7eea84b
feat(test): expose --draft-max and --ignore-eos for DFlash dTree tuning
dusterbloom 1115064
feat(mtp): Phase 2 — load_gemma4_mtp_assistant() loader + 7-assertion…
dusterbloom d4659ca
feat(mtp): Phase 3a — build_mtp_step_graph() + 6-assertion shape test
dusterbloom 05e36e4
feat(mtp): Phase 3b — wire --draft-method {none,dflash,mtp}, byte-ide…
dusterbloom 138de4d
fix(mtp): h_prev capture site, assistant rope_freqs, KQ scale = 1.0
dusterbloom 30b2b50
fix(mtp): GQA block-broadcast + KQ mask + SWA-aware KV wrap
dusterbloom c56879c
fix(mtp): preserve TQ3_0 into FA + 256-pad K view + shared mask acros…
dusterbloom 7b62c07
fix(gemma4): allocate+fill SWA mask for n_tokens==1 decode + bump lla…
dusterbloom f1f811e
fix(mtp): always provide FA mask for head_dim>=512 (any K type)
dusterbloom 323e0f4
docs(bench): gemma4 context-scaling plan + prompt corpus + reproducib…
dusterbloom b441587
docs(gemma4): debugging journey blog — three fixes, prompt-distributi…
dusterbloom e65eefb
docs(gemma4): amend journey blog with corrected pflash + dense ladder
dusterbloom bf8653e
docs(bench): scientific harness — 24-cell dense×MoE × code×creative ×…
dusterbloom File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,7 +1,7 @@ | ||
| [submodule "dflash/deps/llama.cpp"] | ||
| path = dflash/deps/llama.cpp | ||
| url = https://github.com/Luce-Org/llama.cpp-dflash-ggml.git | ||
| branch = luce-dflash | ||
| url = https://github.com/dusterbloom/llama-cpp-turboquant-cuda.git | ||
| branch = feature/tq3-kv-cache | ||
| [submodule "dflash/deps/Block-Sparse-Attention"] | ||
| path = dflash/deps/Block-Sparse-Attention | ||
| url = https://github.com/mit-han-lab/Block-Sparse-Attention.git |
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,57 @@ | ||
| # Matrix v2 at 64k — all fixes in. 2026-05-09T23:48:10+02:00 | ||
|
|
||
| === V1_none starting at 23:48:10 === | ||
| V1_none rc=0 | ||
| === V2_mtp starting at 23:50:30 === | ||
| V2_mtp rc=0 | ||
| === V3_dflash_dm8 starting at 23:52:54 === | ||
| V3_dflash_dm8 rc=0 | ||
|
|
||
| ## Per-cell stats | ||
|
|
||
| ### V1_none | ||
| ``` | ||
| [cache] narrow asymmetric: forced Q8_0 on 2 captured full-attn layer(s) (remaining 8 full-attn keep TQ3) | ||
| [cache] kv types: SWA=tq3_0, full=tq3_0 | ||
| [prefill] 49904 tokens in 85278.6 ms (585.2 tok/s) [chunked+pflash, chunk_size=1024] (last sampled token: 100) | ||
| [stats] generated=256 decode_ms=37108.4 tok/s=6.90 first_tok_ms=145.88 | ||
| [stats] prefill=49904 tokens context_used=50160/65536 | ||
| [mem] VRAM used=21.25 GB total=24.00 GB | ||
| ``` | ||
|
|
||
| ### V2_mtp | ||
| ``` | ||
| [cache] narrow asymmetric: forced Q8_0 on 2 captured full-attn layer(s) (remaining 8 full-attn keep TQ3) | ||
| [cache] kv types: SWA=tq3_0, full=tq3_0 | ||
| [prefill] 49904 tokens in 85189.9 ms (585.8 tok/s) [chunked+pflash, chunk_size=1024] (last sampled token: 100) | ||
| [mtp] steps=256 accepted=5 accept_rate=0.02 | ||
| [stats] generated=256 decode_ms=40432.7 tok/s=6.33 first_tok_ms=164.94 | ||
| [stats] prefill=49904 tokens context_used=50160/65536 | ||
| [mem] VRAM used=21.70 GB total=24.00 GB | ||
| ``` | ||
|
|
||
| ### V3_dflash_dm8 | ||
| ``` | ||
| [cache] narrow asymmetric: forced Q8_0 on 2 captured full-attn layer(s) (remaining 8 full-attn keep TQ3) | ||
| [cache] kv types: SWA=tq3_0, full=tq3_0 | ||
| [draft] KV cache allocated: 2096 slots | ||
| [prefill] 49904 tokens in 85184.4 ms (585.8 tok/s) [chunked+pflash, chunk_size=1024] (last sampled token: 100) | ||
| [draft] KV prefill done: 2096 positions materialized (skipped 47808 early tokens, cap=2096) | ||
| [stats] generated=256 decode_ms=27753.8 tok/s=9.22 first_tok_ms=257.06 | ||
| [stats] prefill=49904 tokens context_used=50160/65536 | ||
| [spec] draft_steps=112 total_accepted=256 avg_accept=2.29 | ||
| [mem] VRAM used=23.59 GB total=24.00 GB | ||
| ``` | ||
|
|
||
| ## Decoded text comparison (first 80 generated tokens) | ||
|
|
||
| ### V1_none | ||
| first_80_decoded: 'swe relentless<unused0>os<bos><unused94><pad><unk><unused6>ock<bos><blockquote><unused0>8<unused6>ublic<unused63>thought<unused94>thought\n<unused95>### Summary of Themes and Characters\n\nThis text consists of several fragmented scenes (likely from a play or a series of dramatic sketches) focusing on the political instability of Rome and the personal conflicts of its leaders.\n\n#### **Major Themes**\n\n* **Pride vs. Humility:** The' | ||
|
|
||
| ### V2_mtp | ||
| first_80_decoded: 'swe absorber<unused3>os<unused2><unused94><unused94>thought\n<unused95>### Summary of Themes and Characters\n\nThe provided text consists of several fragmented scenes (likely from a composite or modified version of Shakespearean-style plays, including elements of *Coriolanus* and *Richard III*). The narrative focuses on the intersection of military glory, political instability, and the volatility of public favor.\n\n#### Major Themes\n\n' | ||
|
|
||
| ### V3_dflash_dm8 | ||
| first_80_decoded: 'swe Bras<mask>os<unused2><unused94><unused94>thought\n<unused95>### Summary of Themes and Characters\n\nThe provided text is a fragmented collection of scenes (likely from a composite or modified version of Shakespearean-style plays, blending elements of *Coriolanus* and *Richard III*). It depicts a world of political instability, violent ambition, and the volatile relationship between the ruling elite and the common people.' | ||
|
|
||
| DONE |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,71 @@ | ||
| # 64k drafter A/B with TQ3 + pFlash (dense 31B) — 2026-05-09T23:05:51+02:00 | ||
| Prompt: long_50k.txt (~50k tokens), ctx=65536, n_predict=256 | ||
|
|
||
| === T1_none === | ||
| T1_none rc=0 | ||
| === T2_mtp === | ||
| T2_mtp rc=0 | ||
| === T3_dflash === | ||
| T3_dflash rc=143 | ||
|
|
||
| ## Per-cell stats | ||
|
|
||
| ### T1_none | ||
| ``` | ||
| [cache] narrow asymmetric: forced Q8_0 on 2 captured full-attn layer(s) (remaining 8 full-attn keep TQ3) | ||
| [cache] kv types: SWA=tq3_0, full=tq3_0 | ||
| [prefill] 49904 tokens in 87859.2 ms (568.0 tok/s) [chunked+pflash, chunk_size=1024] (last sampled token: 100) | ||
| [stats] generated=256 decode_ms=37952.4 tok/s=6.75 first_tok_ms=150.63 | ||
| [stats] prefill=49904 tokens context_used=50160/65536 | ||
| [mem] VRAM used=21.40 GB total=24.00 GB | ||
| ``` | ||
|
|
||
| ### T2_mtp | ||
| ``` | ||
| [cache] narrow asymmetric: forced Q8_0 on 2 captured full-attn layer(s) (remaining 8 full-attn keep TQ3) | ||
| [cache] kv types: SWA=tq3_0, full=tq3_0 | ||
| [prefill] 49904 tokens in 87919.8 ms (567.6 tok/s) [chunked+pflash, chunk_size=1024] (last sampled token: 100) | ||
| [mtp-step 8] accept_rate=0.00 | ||
| 532 81179 108 818 3847 1816 10594 529 [mtp-step 16] accept_rate=0.00 | ||
| 3131 89144 18583 568 19609 699 496 22907 [mtp-step 24] accept_rate=0.00 | ||
| 653 12269 3567 529 36951 508 236772 3061 [mtp-step 32] accept_rate=0.00 | ||
| 10772 236764 2440 4820 529 808 236780 6886 [mtp-step 40] accept_rate=0.03 | ||
| 40707 605 236829 532 808 40421 8488 236829 [mtp-step 48] accept_rate=0.02 | ||
| 769 669 22323 21132 580 506 18074 529 [mtp-step 56] accept_rate=0.02 | ||
| 7820 27877 236764 5255 32202 236764 532 506 [mtp-step 64] accept_rate=0.05 | ||
| 43866 529 1237 4664 236761 108 2595 18787 [mtp-step 72] accept_rate=0.04 | ||
| 137944 108 236829 139 1018 203460 532 19839 [mtp-step 80] accept_rate=0.04 | ||
| 4499 53121 669 6082 12160 84022 2101 506 [mtp-step 88] accept_rate=0.03 | ||
| 16625 1534 33641 532 125860 236761 102301 605 [mtp-step 96] accept_rate=0.03 | ||
| 2481 81341 568 236780 6886 40707 605 236768 [mtp-step 104] accept_rate=0.03 | ||
| 563 496 24240 1933 31451 236764 840 914 [mtp-step 112] accept_rate=0.04 | ||
| 125688 573 506 3364 1331 532 914 45208 [mtp-step 120] accept_rate=0.03 | ||
| 531 623 1674 2737 236775 1091 2080 531 [mtp-step 128] accept_rate=0.03 | ||
| 914 124466 236761 4923 21077 3590 1515 496 [mtp-step 136] accept_rate=0.03 | ||
| 623 45513 236775 528 506 6114 529 914 [mtp-step 144] accept_rate=0.03 | ||
| 22816 532 496 179267 531 506 11838 236761 [mtp-step 152] accept_rate=0.03 | ||
| 107 236829 139 1018 818 6285 26633 529 [mtp-step 160] accept_rate=0.03 | ||
| 506 623 13666 4637 1083 1018 669 1816 [mtp-step 168] accept_rate=0.02 | ||
| 46235 506 214696 4135 529 506 3364 1331 [mtp-step 176] accept_rate=0.02 | ||
| ``` | ||
|
|
||
| ### T3_dflash | ||
| ``` | ||
| [cache] narrow asymmetric: forced Q8_0 on 2 captured full-attn layer(s) (remaining 8 full-attn keep TQ3) | ||
| [cache] kv types: SWA=tq3_0, full=tq3_0 | ||
| ``` | ||
|
|
||
| ## First 80 generated tokens (decoded) | ||
|
|
||
| ### T1_none | ||
| raw extracted (first 80): [49904, 87859, 2, 568, 0, 100, 0, 3, 13, 134, 2, 895, 5, 308, 13, 206, 376, 45518, 100, 45518, 107, 101, 10354, 25252, 529, 137944, 532, 81179, 108, 818, 3847, 1816, 10594, 529, 3131, 89144, 18583, 568, 19609, 699, 496, 22907, 653, 12269, 3567, 529, 36951, 508, 236772, 3061, 10772, 236764, 2440, 4820, 529, 808, 236780, 6886, 40707, 605, 236829, 532, 808, 40421, 8488, 236829, 769, 669, 22323, 21132, 580, 506, 18074, 529, 7820, 27877, 236764, 5255, 32202, 236764] | ||
| decoded (first 80): 'sweולם<bos> (<pad><unused94><pad><unk><unused7>\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n<bos> Y[multimodal]F<unused7><code>�thought<unused94>thought\n<unused95>### Summary of Themes and Characters\n\nThe provided text consists of several fragmented scenes (likely from a composite or modified version of Shakespearean-style plays, including elements of *Coriolanus* and *Richard III*). The narrative focuses on the intersection of military glory, political instability,' | ||
|
|
||
| ### T2_mtp | ||
| raw extracted (first 80): [49904, 87919, 8, 567, 6, 100, 100, 45518, 107, 101, 10354, 25252, 529, 137944, 532, 81179, 108, 818, 3847, 1816, 10594, 529, 3131, 89144, 18583, 568, 19609, 699, 496, 22907, 653, 12269, 3567, 529, 36951, 508, 236772, 3061, 10772, 236764, 2440, 4820, 529, 808, 236780, 6886, 40707, 605, 236829, 532, 808, 40421, 8488, 236829, 769, 669, 22323, 21132, 580, 506, 18074, 529, 7820, 27877, 236764, 5255, 32202, 236764, 532, 506, 43866, 529, 1237, 4664, 236761, 108, 2595, 18787, 137944, 108] | ||
| decoded (first 80): 'swe Tahun<unused2>ation<unused0><unused94><unused94>thought\n<unused95>### Summary of Themes and Characters\n\nThe provided text consists of several fragmented scenes (likely from a composite or modified version of Shakespearean-style plays, including elements of *Coriolanus* and *Richard III*). The narrative focuses on the intersection of military glory, political instability, and the volatility of public favor.\n\n#### Major Themes\n\n' | ||
|
|
||
| ### T3_dflash: no [prefill] marker | ||
|
|
||
| DONE | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,6 @@ | ||
| # Matrix v3 with SWA mask fix — 2026-05-09T22:17:43+02:00 | ||
| === N1_none_q8_tq3 (K=q8_0 V=tq3_0 draft=none) === | ||
| N1_none_q8_tq3 rc=0 | ||
| === N2_none_q8_q8 (K=q8_0 V=q8_0 draft=none) === | ||
| N2_none_q8_q8 rc=0 | ||
| === N3_mtp_q8_tq3 (K=q8_0 V=tq3_0 draft=mtp) === | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. P2: Third matrix entry is missing its rc/result line, leaving the summary incomplete and ambiguous. Prompt for AI agents |
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,6 @@ | ||
| tokenizer: google/gemma-3-27b-it | ||
| chat_template: yes (EvalPlus canonical instruction + opening code-fence) | ||
| source: HumanEval/2 (truncate_number) | ||
| token_count: 139 | ||
| first_20: [105, 2364, 107, 9366, 2847, 496, 1265, 236772, 66436, 17856, 8948, 600, 64744, 506, 2269, 2608, 528, 496, 127532, 3393] | ||
| last_5: [236787, 107, 2717, 6719, 107] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| 105,2364,107,9366,2847,496,1265,236772,66436,17856,8948,600,64744,506,2269,2608,528,496,127532,3393,3355,236787,107,2717,109,2063,102267,236779,5640,236769,5640,236787,6803,236768,3921,6803,236787,107,140,12234,17770,496,4414,18224,1523,1548,236764,625,740,577,81153,1131,107,140,624,11995,912,568,65020,11995,7100,1082,2238,1548,236768,532,70208,107,140,236769,989,1749,912,2462,7100,1082,236743,236770,769,108,140,13293,506,20632,912,529,506,1548,236761,107,140,22539,102267,236779,5640,236769,236800,236761,236810,236768,107,140,236771,236761,236810,107,140,12234,108,2717,106,107,105,4368,107,43760,563,496,17856,8948,607,496,1265,236772,66436,1292,600,64744,506,2608,532,16349,7041,7713,236787,107,2717,6719,107 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,11 @@ | ||
| file: long_2k.txt | ||
| tool: HuggingFace transformers AutoTokenizer, model=google/gemma-3-27b-it (local cache) | ||
| tokenizer_vocab_size: 262144 | ||
| gguf_vocab_size: 262144 (verified via gguf.GGUFReader) | ||
| chat_template_applied: yes | ||
| bos_prepended_in_csv: no (driver prepends BOS=2 automatically) | ||
| token_count: 2611 | ||
| first_20_ids: [105, 2364, 107, 85305, 691, 6534, 531, 974, 1401, 20718, 529, 8116, 684, 1116, 12198, 580, 506, 4856, 236764, 532] | ||
| last_5_ids: [106, 107, 105, 4368, 107] | ||
| source_text: | ||
| Alice in Wonderland Chapter I 'Down the Rabbit-Hole' in full. Source: Project Gutenberg https://www.gutenberg.org/cache/epub/11/pg11.txt (public domain). 2611 tokens, within [2048, 3072] target range. |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2: Benchmark token extraction in benchmark summary is polluted by telemetry/debug numbers, making the reported 'first 80 generated tokens' unreliable and misleading.
Prompt for AI agents