Gemma4 support: pFlash + DFlash + chunked prefill, daemon mode, server routing#131
Gemma4 support: pFlash + DFlash + chunked prefill, daemon mode, server routing#131dusterbloom wants to merge 35 commits intoLuce-Org:mainfrom
Conversation
Full implementation of Gemma4 architecture for lucebox-hub DFlash: Target model (GGUF loader + forward pass graph builder): - Per-layer head_count_kv array (8 for SWA, 2 for full-attention) - Dual head_dim: 256 (SWA) / 512 (full-attention) with correct cache sizing - V=K sharing on full-attention layers (attention_k_eq_v) - MoE FFN: 128 experts, top-8 routing with shared expert + softmax gating - Sliding window attention pattern from BOOL GGUF array - Proportional RoPE (p-RoPE) with per-layer freq_factors - Embedding scaled by sqrt(hidden_size) per HF reference - CUDA FA 256-alignment for head_dim>=512 (FATTN_KQ_STRIDE) - TurboQuant TQ3_0 KV cache with 256-byte alignment padding - Logit softcapping: 30 * tanh(logits / 30) Draft model (safetensors loader + forward pass): - 5-layer transformer with SwiGLU FFN - FC projection: 6 * target_hidden -> draft_hidden - Tied LM head using target tok_embd - Block-diffusion speculative decoding architecture
5 smoke tests validating the Gemma4 implementation: - smoke_load_gemma4_target: GGUF metadata, per-layer head_kv, SWA pattern - smoke_gemma4_target_forward: full 26B-A4B forward pass, logits in [-30,30] - smoke_load_gemma4_draft: safetensors loading, fc/layer shape validation - smoke_gemma4_draft_forward: draft forward with injected target tok_embd - test_gemma4_kv_tq3: TQ3 cache 256-alignment, shared layer donors Plus test_gemma4_dflash driver for combined target+draft benchmarking.
The evenly-spaced formula produced wrong IDs for both Gemma4 variants.
Use the actual values from the z-lab DFlash draft model config.json:
- 26B-A4B (30 layers): {1, 6, 11, 17, 22, 27}
- 31B (60 layers): {1, 12, 23, 35, 46, 57}
Fall back to evenly-spaced for unknown layer counts.
The draft model was stateless (no KV cache), giving 0% speculative acceptance. Add prefix-direct KV materialization: target features are projected through FC → hidden_norm → per-layer K/V, stored in a dedicated draft KV cache. The draft forward now attends to this cache, matching the SGLang/vLLM DFlash architecture. Gemma4-26B-A4B with draft: avg 10.67 tokens accepted per step, ~250 tok/s decode on RTX 3090 (vs ~67 tok/s baseline).
Replace single-token autoregressive prefill with chunked batched forward. Each chunk processes up to swa_window tokens in a single GPU dispatch, cutting prefill from ~66 tok/s to ~830-1060 tok/s on RTX 3090. Add swa_mask to GemmaGraphInputs so SWA attention layers use a sliding-window mask during batched prefill while full-attention layers keep the standard causal mask.
Add --csv flag for direct use with test_gemma4_dflash --tokens. Default model changed to google/gemma-4-26b-a4b-it. Add --verbose flag, local_files_only caching, and --add-bos option.
Converts BF16 safetensors draft weights to Q8_0 GGUF format. Projection weights quantized to Q8_0 (~50% size), norms kept F32. Includes Gemma4-specific GGUF metadata (sliding_window, logit_softcap, target_layer_ids). Requires a C++ GGUF loader to be used at inference.
Three bugs prevented coherent speculative decoding output: 1. Missing BOS token: Gemma4 requires BOS (token 2) at position 0. Auto-prepend from GGUF bos_token_id when not already present. 2. Missing EOT fallback: many Gemma4 GGUFs omit eot_token_id, so eos_chat_id stayed -1 and <end_of_turn> (107) was never caught. Default to 107 when the key is absent. 3. Uninitialized SWA mask in speculative verify: when n_tokens > 1, build_gemma4_step allocates swa_mask but only attn_mask was filled. SWA layers used garbage memory, corrupting all hidden states and collapsing output to token 0 (padding) from step 2 onward. Verified: DFlash now produces identical output to AR baseline and stops at EOS. Gemma4-31B Q4_K_M + TQ3_0 KV = 80.82 tok/s (2.37x over AR 34.14 tok/s) on RTX 3090.
… script load_gemma4_draft_gguf() reads Q8_0-quantized draft weights from GGUF, auto-detected by .gguf extension on --draft path. Q8_0 drafter matches BF16 acceptance (AL=6.74) while loading 44% faster and using 380MB less VRAM. quantize_gemma4_draft_q8.py now reads config.json for model dimensions instead of hardcoding 26B constants, supporting both 26B-A4B and 31B drafters.
…ttention Layer-by-layer prefill using FlashPrefill block-sparse WMMA attention for full-attention layers and ggml FA for SWA layers. Includes gallocr pre-reserve to eliminate graph allocator overhead and fused [B+SWA] graphs to reduce hidden_buf round-trips. Benchmarks at 6K tokens (26B-A4B): 4073 tok/s (+12% over chunked prefill). Real gains expected at 64K+ where attention density drops below 10%.
Add --pflash, --pflash-alpha, and --tokens-file flags to test harness. --tokens-file reads comma-separated IDs from a file, bypassing ARG_MAX limits for prompts >16K tokens. Fix draft KV cache overflow crash when prompt exceeds draft sliding window (2096 slots). Clamp prefill to trailing window, adjust ring-buffer offset, and add defensive assert in build_draft_kv_prefill_graph().
… in FA FWHT rotation for TQ3_0 KV cache is now handled inside the Flash Attention CUDA kernel via warp-cooperative shuffle. Remove the separate ggml_turbo_wht graph ops from build_swa_attn_block() and build_full_attn_block().
SWA layers only need swa_window slots, not the full context. At 64K with Gemma4 (50 SWA, 10 full-attn layers), this saves 81.8% of KV VRAM. Ring-buffer read/write positions use modular arithmetic so SWA cache views never exceed tensor boundaries at long contexts. Verified: 31B Dense at 64K uses 22.06 GB (target-only), 24.00 GB (full stack with Q8_0 draft + TQ3_0 KV + DFlash decode at 29.26 tok/s).
After prefill fills all 2096 draft KV slots, the first decode step would crash with "draft KV overflow". Now wraps draft_kv_pos with modulo arithmetic, treating the draft cache as a ring buffer.
Decouple Graph A/B chunk size (32K) from SWA window (1K-2K). Batch consecutive SWA layers into single ggml graphs to reduce graph build overhead. SWA_CHUNK now tracks actual cache allocation. Full-attn layers keep the existing Graph A → pFlash → Graph B path. pFlash integration into single-graph-per-chunk architecture is next.
Replaces the layer-by-layer gemma4_pflash_prefill() with a single-graph- per-chunk path using the new GGML_OP_FLASH_ATTN_SPARSE op for full- attention layers. SWA layers continue to use ggml_flash_attn_ext. Perf (MoE 26B-A4B at 64K, RTX 3090, Q8_0 KV): chunked baseline: 1867 tok/s prefill, 100.6 tok/s decode, 10.67/16 accept + --pflash: 3374 tok/s prefill (1.81x), 101.8 tok/s decode Changes: - Adapter (pflash_ggml_adapter.cpp/h) registers the pFlash CUDA kernel with the ggml op. Maps alpha>=1.0 to fully-dense mode. - build_full_attn_block() conditionally uses ggml_flash_attn_sparse when use_pflash is set. - attn_mask is skipped (in graph + driver) when use_pflash=true since the sparse op applies block-level causal internally. - gemma4_pflash_prefill.cpp removed (replaced by chunked path). - test/test_flash_attn_sparse.cpp: TDD coverage for the ggml op (dense vs sparse @ alpha=1.0 within BF16 precision; alpha<1.0 liveness). Ported upstream fixes: - TQ3_0 mask stride (PR Luce-Org#128): bump g_kq_stride_pad to 256 when KV is selected via DFLASH27B_KV_K/V env vars. Prevents NaN at chunk sizes 256/512/1024/2048 with TQ3_0 KV. - last_token_logits_only (PR Luce-Org#108): skip lm_head matmul over all but last token during prefill chunks. Saves ~1GB output tensor and ~1000x lm_head compute per chunk on Gemma4-31B (vocab=262144).
Three correctness fixes after benchmarking exposed silent corruption when --pflash was combined with quantized KV: 1. Graph-level type check in build_full_attn_block: dispatch to ggml_flash_attn_sparse only when K/V are F16/Q8_0/Q4_0. TQ3_0 falls back to ggml_flash_attn_ext because TQ3's WHT rotation requires special handling not yet in the sparse path. 2. Always allocate attn_mask in test_gemma4_dflash (previously skipped when use_pflash=true). When some full-attn layers fall back to dense FA (non-supported KV types), the mask is required. 3. Guard ggml_backend_tensor_set on attn_mask/swa_mask buffer existence: when all full-attn layers use sparse FA, the mask tensor is unreferenced by any compute op so gallocr leaves its buffer NULL. ggml_set_output is added as a hint but doesn't force allocation; skip the write when buffer is NULL. swa_mask gets the same defensive check. Measured on Gemma-4-31B Q4_K_M, RTX 3090, Q8_0 KV: 4K: 1348 -> 1483 tok/s prefill (+10%), output matches baseline 8K: 1441 -> 1546 tok/s prefill (+7.3%), block-sparse approximation Earlier MoE 64K "1.81x speedup" claim was on the broken sparse path (reading Q8 bytes as F16); that data point is invalid. The current numbers are on verified-correct execution. TQ3_0 + chunked path is broken independently of pflash (produces token 0); needs separate debug.
The host-built SWA causal mask was filled in absolute KV coordinates (mask[q][abs_k] = 0 for valid keys) but the FA CUDA kernel reads it indexed by view position (k_view = 0..effective_win_len-1, where slot 0 = the cache offset where the K view starts). For every prefill chunk where kv_start > 0, the K view starts at ring_win_start in the cache (computed in build_swa_attn_block as kv_start - swa_window aligned to the ring buffer). The mask cell [q][k_view=0] was written assuming absolute slot 0, which is far before the window's lo bound, so it stayed -inf. The kernel then saw every K-view position as -inf for q rows touching that chunk. Symptoms: - Q8/F16 KV: degraded but plausible-looking output (NaNs absorbed by saturating arithmetic; argmax landed on some non-zero index) - TQ3_0 KV: clean NaN propagation through WHT-rotated FA path; argmax over NaN-containing logits returns 0 (because `if (x[i] > best)` is false for NaN). This is why "TQ3 produces token 0" was the visible failure mode. Fix: - Add SwaView struct + compute_swa_view() helper in internal.h / gemma4_target_graph.cpp encapsulating the (abs_win_start, effective_win_len, ring_win_start) math - build_swa_attn_block calls the helper instead of inlining - build_swa_causal_mask in test driver takes (abs_win_start, win_len, n_tokens, kv_start, swa_window); writes mask[q][k_view] for k_view in [0, win_len), using abs_win_start + k_view to check the absolute causal window - swa_mask tensor sized [align_up(effective_win_len, g_kq_stride_pad), q_pad] instead of [align_up(kv_len, g_kq_stride_pad), q_pad] - Both prefill chunk loop and spec-decode verify loop call the helper to get matching geometry Measured impact (Gemma-4-31B Q4_K_M, RTX 3090): 8K Q8 baseline last sampled token: 236770 (broken) -> 236799 (correct) 8K Q8 +pflash: 1284 -> 1497 tok/s (+16.6%) Bug entered with chunked prefill (commit 7ce68ac); SWA ring-buffer (commit f2c36bc) made the offset non-monotonic in kv_start. The reference Qwen3.5 driver (test/test_dflash.cpp:547-565) already had this correct via `out_mask[q*kv_pad + (k - win_start)]`. TQ3_0 still produces token 0 after this fix; that is a separate TQ3-specific bug.
Wires the Gemma4 binary into scripts/server.py so the OpenAI-compatible HTTP server can serve Gemma-4-31B and Gemma-4-26B-A4B (with the pFlash + DFlash + chunked prefill stack we built this session). ## test/test_gemma4_dflash.cpp Added a daemon mode that mirrors the IPC protocol used by test_dflash (Qwen3.5 binary): - New flags: --daemon, --stream-fd=N, --max-ctx=N (alias for --ctx-size) - No-op flags accepted for cmdline compatibility with server.py: --fast-rollback, --ddtree, --ddtree-budget=B, --ddtree-temp=F, --ddtree-no-chain-seed - After model load, prints "[daemon] ready" to stdout and enters a stdin loop reading line-based commands - Supported command: <prompt_bin_path> <n_gen> [samp=t,p,k,r[,seed]] - prompt_bin_path is a binary file of int32 LE token IDs - Each generated token is written as int32 LE to stream_fd; -1 sentinel marks end of generation - Unsupported commands (RESTORE, SNAPSHOT, compress, park, ...) are acknowledged with -1 sentinel for now (out of scope for v1) ## scripts/server.py - _read_gguf_architecture() reads general.architecture from a GGUF - main() detects "gemma4" and switches DEFAULT_BIN to test_gemma4_dflash - For Gemma4 the draft argument stays as a directory (matching the binary's CLI); for Qwen3 it stays a file as before - Daemon command is built differently per arch: Gemma4 uses --model / --draft named flags and accepts --pflash, Qwen3 keeps the existing positional form - New top-level --pflash flag passes through to the Gemma4 daemon Smoke-tested locally with the 26B-A4B model + 4096-token prompt, n_gen=16: daemon prints "[daemon] ready", consumes the binary prompt file, runs chunked prefill, decodes 16 tokens streamed as int32 LE on fd=3, and emits the -1 sentinel. Tokens are valid Gemma4 vocab IDs.
The parent's submodule pointer references commits that live only on github.com/dusterbloom/llama-cpp-turboquant-cuda (our pflash sparse-FA work). Update .gitmodules so cloners fetch from that fork instead of the upstream Luce-Org/llama.cpp-dflash-ggml repo (which doesn't have these commits). Maintainer can rewrite this URL post-merge if the commits get mirrored to a Luce-Org repo.
There was a problem hiding this comment.
11 issues found across 20 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="dflash/src/errors.cpp">
<violation number="1" location="dflash/src/errors.cpp:30">
P2: Returns a pointer into shared mutable error storage after unlocking, so concurrent `set_last_error()` calls can invalidate the returned `const char *`.</violation>
</file>
<file name="dflash/scripts/quantize_gemma4_draft_q8.py">
<violation number="1" location="dflash/scripts/quantize_gemma4_draft_q8.py:227">
P2: Missing validation for empty `target_layer_ids` can crash quantization with modulo-by-zero when computing `TARGET_HIDDEN`.</violation>
</file>
<file name="dflash/scripts/server.py">
<violation number="1" location="dflash/scripts/server.py:69">
P2: Swallowing GGUF read errors here makes Gemma4 detection fail open, so the server silently takes the non-Gemma4 daemon path and uses the wrong argv shape instead of failing explicitly.</violation>
<violation number="2" location="dflash/scripts/server.py:893">
P2: Gemma4 draft path validation accepts non-directory paths by falling back to the parent directory, masking typos and using the wrong draft directory.</violation>
</file>
<file name="dflash/src/gemma4_target_loader.cpp">
<violation number="1" location="dflash/src/gemma4_target_loader.cpp:604">
P2: Failure paths after allocating `out.buf` return without cleaning up partial `GemmaTargetWeights`, so load errors can leak backend memory unless every caller manually frees on failure.</violation>
<violation number="2" location="dflash/src/gemma4_target_loader.cpp:675">
P2: Missing validation that `tok_embd_sz` is divisible by `n_vocab` before deriving `row_bytes` can corrupt embedding row strides for malformed GGUFs.</violation>
</file>
<file name="dflash/CMakeLists.txt">
<violation number="1" location="dflash/CMakeLists.txt:157">
P2: pFlash is gated by the first CUDA arch entry instead of the true minimum SM, which can wrongly enable sm80-only sources for unsorted mixed-arch builds.</violation>
</file>
<file name="dflash/test/test_flash_attn_sparse.cpp">
<violation number="1" location="dflash/test/test_flash_attn_sparse.cpp:107">
P2: The dense-vs-sparse correctness check is too permissive and can mask bad outputs, including non-finite values.</violation>
</file>
<file name="dflash/src/gemma4_dflash_graph.cpp">
<violation number="1" location="dflash/src/gemma4_dflash_graph.cpp:184">
P2: Missing bounds validation for kv_start + n_tokens before KV-cache writes in build_gemma4_draft_graph().</violation>
</file>
<file name="dflash/test/test_gemma4_dflash.cpp">
<violation number="1" location="dflash/test/test_gemma4_dflash.cpp:906">
P2: Daemon requests with the default seed 0 never reseed the shared RNG, so sampling becomes order-dependent across requests.</violation>
<violation number="2" location="dflash/test/test_gemma4_dflash.cpp:1591">
P2: Resetting `draft_kv_pos` to 0 on cache overflow discards the draft context instead of preserving a valid context length, so speculative decoding runs with an empty draft KV cache once capacity is reached.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
|
@dusterbloom can you fix the merge conflict? Great contribution!! |
Bundled defensive fixes from code review:
1. errors.cpp: thread-local snapshot of last_error before c_str() return —
prevents concurrent set_last_error() from invalidating the returned
pointer across threads.
2. server.py:69: log GGUF read failures to stderr instead of silently
returning ""; prevents Gemma4 detection from failing open and using
the wrong daemon argv shape.
3. server.py:893: explicit branches for is_dir / is_file / not-found
on --draft path; no more silent fallback to parent directory that
masks user typos.
4. quantize_gemma4_draft_q8.py: confirmed existing N_TARGET_LAYERS == 0
guard at line 215 prevents the modulo-by-zero (no edit required).
5. gemma4_target_loader.cpp: cleanup_out lambda free's out.buf and
resets state on every failure path after the buffer allocation —
prevents backend memory leak on load errors.
6. gemma4_target_loader.cpp: validate tok_embd_sz % n_vocab == 0
before computing row_bytes — fails fast on malformed GGUFs instead
of corrupting embedding strides.
7. CMakeLists.txt: replace list(GET _dflash27b_archs 0 ...) with an
explicit min loop over all configured CUDA arches — pFlash now
correctly disables when ANY arch in the list is below sm_80.
8. test_flash_attn_sparse.cpp: add explicit non-finite (NaN/inf) check
in the dense-vs-sparse comparison; printf reports nonfinite=YES/no
and the return value requires both finite values and max_diff < 1.0.
9. gemma4_dflash_graph.cpp: GGML_ABORT on out-of-bounds kv_start +
n_tokens at the top of build_gemma4_draft_graph — catch at graph
build time instead of corrupted-memory crash later.
10. test_gemma4_dflash.cpp daemon: always reseed the RNG per request
(random_device when seed=0); prevents order-dependent sampling
across concurrent daemon requests.
11. test_gemma4_dflash.cpp draft KV overflow: replace the hard reset
cache.draft_kv_pos = 0 with a sliding-window re-prefill from the
last `keep = dkv_cap - q_len` accepted tokens. This was discarding
ALL draft context once the ring filled, causing DFlash speculative
acceptance to crash from 10.67/16 (32K) to 1.23/16 (64K) — matching
the LongSpec arXiv:2502.17421 long-context regression mode for
EAGLE-style drafters.
Also includes the WIP TQ3 rotation infrastructure (submodule pointer
bump). Self-test DFLASH_TQ3_VERIFY=1 confirms the rotation is
mathematically reversible (max_diff=0.000000 on roundtrip). TQ3 chunked
output still wrong; the bug is downstream of rotation.
# Conflicts: # dflash/deps/llama.cpp # dflash/scripts/server.py
Two interlocking bugs were silently corrupting Gemma4 multi-chunk prefill, producing all-zero decoded tokens (artificially high spec accept rate because target and drafter both predict token 0 deterministically). 1. SWA ring optimization (swa_ctx_alloc = swa_window + headroom) saves VRAM at long contexts but ring-wraps during multi-chunk prefill. The K view is constrained to a single contiguous ring slice [ring_win_start, ring_size), which on wrap covers only the pre-wrap portion. Post-wrap tokens (the latest writes) are silently omitted — queries at positions spanning the wrap can't attend to themselves or recent context. Pragmatic fix: swa_ctx_alloc = max_ctx_alloc unconditionally. SWA layers behave like full-attn during prefill. We lose the VRAM optimization but restore correctness. Future work: implement double-view SWA reads (concat pre-wrap + post-wrap views) so the memory savings can come back without correctness regression. 2. SWA ring-wrap also produced a non-256-aligned win_len_padded clamp for TQ3_0 (which requires FATTN_KQ_STRIDE=256), causing SIGSEGV. Snap ring_win_start down to the nearest 256-multiple so the K view length stays aligned. The mask already excludes the extra padded tokens. Now redundant given (1) but kept as a safety net. Also adds an env-gated [CACHE-WRITE-PROBE] in the test driver (DFLASH_TQ3_PROBE_CACHE_WRITE=1) for future debugging. Submodule bump pulls in: - fix(ggml-cuda): honor view_offs in cpy data pointer - perf(ggml-cuda): skip cudaMemGetInfo on chunked-FA hot path Verified end-to-end on RTX 3090: Dense 31B + Q8 + draft @ 2.5K = real tokens (was: all zeros) Dense 31B + TQ3 + draft @ 2.5K = real tokens (was: SIGSEGV) MoE 26B + TQ3 + draft @ 16K = real tokens, 1969 tok/s prefill Dense 31B + TQ3 + draft @ 4K = real tokens, 480 tok/s prefill
Replace the disable-fix (swa_ctx_alloc = max_ctx_alloc) with a properly-
sized ring + non-monotonic mask formula. Restores 70-95% SWA cache
VRAM savings at long contexts while keeping multi-chunk correctness.
Architecture:
- Ring sized to hold the last R = 2 * swa_window keys (= 2 chunks worth).
Always contains the relevant key window for any chunk, but in non-
monotonic order after wrap (newest tokens land in pre-wrap slots).
- K view is ALWAYS the full ring (ring_win_start = 0, len = ring_size).
The kernel reads the full ring; correctness comes from the mask.
- build_swa_causal_mask uses an abs_pos formula:
latest_slot = (kv_end - 1) % ring_size
offset_back = (latest_slot - k_view + R) % R
abs_k = (kv_end - 1) - offset_back
This handles any wrap pattern correctly.
- K/V WRITE path splits on wrap: when kv_start % R + n_tokens > R,
issue two ggml_cpy ops (pre-wrap [write_pos, R) + post-wrap [0, post_n)).
- compute_swa_view returns full-ring geometry; no truncation, no
alignment-snap, no contiguous-segment assertion.
Verified on RTX 3090, ~15 min run including TQ3 trifecta:
T1 single-chunk @ 900 (Q8 + draft): sampled=236774, real tokens
T2 2-chunk @ 2.5K (Q8 + draft): decoded 514, 4755, 822, 2864...
T3 ring-wrapping @ 8K (Q8 + draft): 1340 tok/s, real tokens
T4 MoE 16K + TQ3 + draft (the one): 2489 tok/s, swa=2048, saved 72.9%
VRAM at 64K Gemma4-31B: previously 5.5 GB SWA cache (disable-fix),
now ~0.18 GB (50 SWA layers * 2048 * 1792B = 30x reduction).
Submodule bump pulls in the [TQ3-DEQ] printf re-gate.
Adds per-layer KV type machinery + a narrow override that forces Q8_0
on the small subset of full-attn layers whose hidden states are captured
for the DFlash drafter (target_feat ring). Mirrors vLLM's
kv-cache-dtype-skip-layers pattern.
Why: upstream FA dispatch (deps/llama.cpp/.../fattn.cu:441) routes
TQ3_0 + Q->ne[0]>256 to slow CHUNKED kernel. On Dense Gemma4-31B
(full-attn head_dim=512), this is a perf trap. Forcing the drafter's
captured layers to Q8 unblocks the pflash sparse fast path for the
slice the draft consumes.
Gate: kv_type==TQ3 && head_dim>256 && draft wired (capture_layer_ids
non-empty). SWA layers always exempt (don't hit the trap).
Empirical impact (RTX 3090, Dense 31B Q4_K_M + TQ3 + draft + pflash @ 4K):
- Dense override fires on 2 of 10 full-attn layers (capture IDs 12, 46)
- Prefill 48 -> 50 tok/s (marginal; 8 remaining full-attn still slow)
- MoE override fires on 2 of 4 captured (3 keep TQ3); no regression
(1464 tok/s under GPU contention vs 2489 dedicated)
- Q8 control unchanged (gate requires TQ3)
Recommendation for production: Dense 31B + draft -> use Q8_0 KV
(505 tok/s prefill in our testing) until an upstream MMA-F16 TQ3
dequant kernel for head_dim=512 lands. TQ3 KV remains optimal for
MoE 26B-A4B (2489 tok/s @ 16K).
Per-layer machinery (kv_k_type_per_layer, kv_v_type_per_layer) is kept
infrastructure for future asymmetric experiments.
Submodule commit 580246202 adds an opt-in (DFLASH_TQ3_MMA=1) route for TQ3_0 KV through the MMA-F16 tensor-core path: - New k_tq3_0_dequant_f16_full bulk-dequant kernel - Intercept in ggml_cuda_flash_attn_ext_mma_f16 with pool-allocated f16 K/V temp buffers - tq3_needs_chunked guard lifted when env var set Target prefill (Dense 31B + TQ3 + pflash, no draft): 420 -> 610 tok/s. Note: with --draft enabled, Dense+TQ3 still hits the 9x penalty bug (separate from FA dispatch). MMA fix is a building block toward closing the gap.
When --draft is a directory containing both draft-q8_0.gguf (1.6 GB) and model.safetensors (3 GB BF16), prefer the GGUF. The BF16 safetensors draft pushed Dense+TQ3 over the 24 GB VRAM ceiling on a 3090, which fragmented the allocator and triggered host-side cudaStreamSynchronize stalls (per nsys: 67% of total CUDA time, max sync 1.5s) — collapsing target prefill from 800+ tok/s to 41 tok/s. The fix detects this case, logs a warning so the user knows what happened, and loads the GGUF. Empirical impact (RTX 3090, draft path = directory): Dense 31B + TQ3 + draft + pflash @ 4K: 41 -> 797-852 tok/s (~20×) MoE 26B + TQ3 + draft + pflash @ 16K: 2489 -> 3089 tok/s (+24%) VRAM (MoE 16K): 24.0 GB -> 19.3 GB This makes 852 tok/s the new ceiling for our Dense-31B + TQ3 + spec-decode trifecta on a single RTX 3090, beating the prior best-known by ~6× (stock llama.cpp/ollama hangs at 3-4K — see ollama#15350). Bonus: explicit `--draft .../draft-q8_0.gguf` already worked; this just removes the foot-gun for users passing the directory.
Add --draft-max <N> to runtime-cap the verify batch. The GGUF's
architectural block_size=16 stays validated at load; the new flag just
consumes only the first N draft tokens per cycle. Add --ignore-eos to
measure pure decode speed past natural EOS.
Empirical sweep on chat-style 4K real prompt at temp=0:
MoE 26B-A4B + TQ3 + DFlash + pflash @ 4K
dm=4 85.10 t/s AL=2.88/4 <- baseline 52 t/s, +63%
dm=8 50.28 t/s AL=2.08/8
dm=16 44.12 t/s AL=2.31/16 <- prior shipped default
Dense 31B + TQ3 + DFlash + pflash @ 4K (--ignore-eos run)
dm=4 36.78 t/s AL=3.51/4
dm=8 42.07 t/s AL=5.95/8 <- baseline 22 t/s, +87%
dm=16 25.74 t/s AL=3.16/16
block_size=16 was a CEILING, not an optimum. Chat workloads have AL=2-3
(MoE) / AL=3-6 (Dense), so dm=4-8 amortizes the per-step draft cost (5
layers x ~5 ms autoregressive) correctly while dm=16 over-batched and
lost decode throughput.
Per-model optimum differs (MoE: dm=4, Dense: dm=8). Ship as runtime knob;
loader's block_size validation stays unchanged.
… test
Add MtpDrafterWeights + MtpLayerWeights structs to internal.h. Implement
load_gemma4_mtp_assistant() in gemma4_target_loader.cpp to ingest the
AtomicChat-published gemma-4-31B-it-assistant GGUF (Q4_K_M, 49 tensors,
337 MB).
Loader contract (all 7 assertions PASS on the 31B GGUF):
n_embd_backbone == 5376 (target hidden)
requires_target_arch == "gemma4"
4 transformer blocks
attention_k_eq_v == true
pre_projection [2*backbone, n_embd] = [10752, 1024]
post_projection [n_embd, backbone] = [1024, 5376]
per-layer donor target index in [0, 60) — resolved by SWA-pattern match,
NOT a hardcoded "last SWA + last full" pair (mirrors atomicbot
gemma4-assistant.cpp:12-27)
Two surprises vs the plan that change Phase 3:
* 31B assistant uses CENTROID LM head (n_centroids=2048,
use_ordered_embeddings=true) — every AtomicChat 31B quant inherits
this from google/gemma-4-31B-it-assistant. v1 cannot skip centroids.
* MTP working dim n_embd=1024 differs from backbone 5376; bridged by
pre/post projection. Added n_embd field to MtpDrafterWeights and
reads from gemma4_assistant.embedding_length GGUF metadata.
SWA layout on 31B: layers {0,1,2}=SWA, layer 3=full → donors {59,59,59,58}.
Phase 0 spike with atomicbot's built llama-server is NO-GO: their fork
crashes in mmq.cuh:4241 (mmq_x_best=0) on first decode regardless of KV
type, and test-speculative-mtp shows sync vs async draft tokens diverge.
We use their SOURCE as contract reference, not their BUILD as oracle.
The 337 MB Q4_K_M GGUF parses cleanly and serves as our gold input.
Build adds test_mtp_loader as a conditional CMake target. RED-GREEN
locked: same test file that previously failed to compile now exits 0.
Add gemma4_mtp_graph.cpp (503 lines): single-step MTP graph that maps
(last_token, h_prev, pos) -> (logits, h_post, in-graph argmax). Cross-
attention reads target K/V from per-MTP-layer donor (resolved at load).
KV mask shared across gamma steps per MTP.md (all step positions
> attn_pos -> causal/SWA admit uniformly).
Mirror atomicbot/gemma4-assistant.cpp lines 28-130 for the per-step
build, lines 130-220 for the centroid LM head. Use atomicbot only as
contract reference — their llama-server build is broken (mmq.cuh:4241
crash on first decode regardless of KV type).
Add MtpStepGraph struct + build/free decls to internal.h. Add
token_embd.weight optional load to MtpDrafterWeights (will be null on
Q4_K_M, present on F16 — graph picks centroid path when null).
Test (test_mtp_graph_shapes.cpp, 298 lines): builds graph from real
GGUF + stub target, asserts 6 output tensor shapes. PASS on all 6:
out_logits [n_vocab=262144, 1] f32
out_h_post [n_embd_backbone=5376, 1] f32
out_argmax [1] i32
in_tok [1] i32
in_h_prev [n_embd_backbone, 1] f32
in_pos [1] i32
Phase 2 (test_mtp_loader) regression: 7/7 still PASS.
Two surprises caught during build:
* Dense 31B MTP has variable head_dim per layer type — SWA layers 0-2
use head_dim_q=256, full-attn layer 3 uses head_dim_q=512. The stale
GEMMA4_31B_HEAD_DIM=128 in gemma4.h is wrong but unused on this
path; the new graph derives head_dim from attn_q_norm->ne[0].
* token_embd.weight absent in Q4_K_M GGUF — fine for centroid path
(Dense 31B uses centroids + token_ordering for output, target's
tok_embd for input); a non-centroid drafter would need the F16 tier.
Phase 3b (spec-loop wiring at test_gemma4_dflash.cpp + h_prev capture
at gemma4_target_graph.cpp:1006) deferred to a follow-up commit.
…ntical gate
Phase 3b infrastructure + Phase 3a graph fixes for cross-attention
shape compatibility. End-to-end:
./test_gemma4_dflash --model <31B.gguf> --kv-k tq3_0 --kv-v tq3_0
--pflash --max-ctx 8192 --tokens-file <4K.csv>
--n-predict 32 --temp 0 --seed 0
--mtp <31B-assistant.Q4_K_M.gguf>
--draft-method mtp
runs to exit 0 and produces a token stream byte-identical to
--draft-method none on the same seed/temp. Regression-free DFlash path
preserved (have_draft path unchanged when --mtp not set).
Files touched:
test_gemma4_dflash.cpp +290 CLI (--mtp, --draft-method, DraftMethod
enum), DraftMethod::Auto resolver, MTP
weights/graph init alongside DFlash,
mtp_h_prev allocator/buffer in driver,
per-step graph rebuild + ggml_gallocr
alloc, draft accept/fallback loop, free
on cleanup.
gemma4_target_graph.cpp +23 h_prev capture at the existing capture-
layers tap (line ~1006), gated on
cache.mtp_h_prev_enabled and the resolved
last full-attn layer index.
internal.h +35 MtpStepGraph struct + build/free decls;
mtp_h_prev / mtp_last_full_layer fields
on GemmaTargetCache; DraftMethod enum.
gemma4_target_loader.cpp +18 Optional token_embd.weight load into
MtpDrafterWeights.tok_embd (null on
Q4_K_M GGUF since centroid head bypasses
it).
gemma4_mtp_graph.cpp +196/-64
Cross-attention rewrite: Q/K head_dim
reconciled (was 256 vs 128 mismatch
that crashed ggml_can_mul_mat). Replaced
ggml_flash_attn_ext with manual attn —
permute K, ggml_cast quantized→F16/F32,
ggml_repeat for GQA, mul_mat → scale →
soft_max → mul_mat. The fused FA kernel
selector (fattn.cu:652) had no path for
the MTP layer's specific (head_dim ×
n_head × n_kv) combo on either TQ3 OR
F16 KV. Manual attention is general and
works for any shape.
Known gap (deferred to Phase 4):
--draft-method mtp on degenerate-loop prompt shows accept_rate=0.00.
Byte-identical gate is met (verifier falls back to target's argmax on
rejection), but MTP itself is predicting wrong tokens. Need a real
long-form prompt to measure AL properly + diagnose. Possible causes:
h_prev capture point off, RoPE freqs mismatched, centroid head
scatter wrong, or KV mask handling on the cross-attn path.
VRAM budget concern: 24.00/24.00 GB on Dense 31B + TQ3 + MTP at 4K.
Per-step graph rebuild also burns time — Phase 4 will need allocator
reuse for any chance of perf, but correctness comes first.
Three focused fixes for Gemma4 MTP draft prediction quality. (1) Move mtp_h_prev capture from inside the per-layer loop (gemma4_target_graph.cpp:1047) to AFTER the final RMSNorm (line 1075). h_prev must be the post-output-norm hidden — the same vector fed to lm_head — per vLLM PR #41745:569-621 + llama.cpp PR #22738. Capturing inside the layer loop fed the draft head pre-norm hiddens it was not trained on. (2) Wire assistant's own top-level rope_freqs.weight (shape [256] f32) into MtpDrafterWeights and prefer it for the full-attn MTP layer's RoPE rotation. Falls back to target.layers[donor_il].rope_freqs only when the assistant did not ship one (legacy GGUFs). vLLM PR #41745:422-436 documents that MTP draft must build its own RoPE from its own rope_parameters[layer_type], not reuse the target's runtime freqs (which can be quantized or rotated by FWHT in our stack). (3) KQ scale mismatch in cross-attention: change from target.attn_scale (1/sqrt(head_dim)) to assistant's f_attention_scale = 1.0. Confirmed against atomicbot gemma4-assistant.cpp:139-140 / llama-model.cpp:1651 via Codex audit. Smoking-gun cause of greedy divergence on every step — wrong scale produced a different softmax distribution. After this fix, MTP draft emits independent predictions (e.g. tokens 236772, 1852, 92450, ...) instead of trivially defaulting to target's argmax (which had been masking the bug as "byte-identical" while accept_rate stayed 0). Status: - Phase 3 byte-identical gate still met (target-only and --draft-method mtp produce identical token streams when MTP rejects every draft). - accept_rate still 0% on degenerate test prompts — MTP now makes real (but still wrong) predictions. Remaining suspects per Codex audit are GQA head-grouping (item 2), KQ mask handling (item 3), and KV view length (item 4). Real-prompt evaluation deferred to a fresh Phase 4 run.
Three correctness fixes in cross-attention per Codex audit: (1) GQA head broadcast (lines ~340-415): replace direct ggml_repeat (which tiles by modulo: 0,1,...,Hkv-1,0,1,... — interleaved) with a ggml_view_4d + ggml_cont + ggml_reshape_3d block-broadcast pattern that produces 0,0,...,1,1,... block layout, matching standard GQA semantics. Each KV head is now correctly shared by n_head_fa/n_head_kv consecutive Q heads. (2) KQ mask (line ~455): replace ggml_soft_max(KQ) with ggml_soft_max_ext(KQ, KQ_mask, 1.0f, 0.0f) using an all-zero F32 mask. Atomicbot constructs a mask in llama-graph.cpp:2511-2515; passing a zero-bias mask matches the "all positions admitted" semantic for cross-attn while keeping the ext softmax kernel happy. (3) SWA-aware KV view (lines ~301-355): replace the bare min(attn_pos, cache_k->ne[1]) clamp with proper ring-buffer wrap handling. SWA layers now (a) clamp to swa_window-1 admitted positions, (b) compute ring start slot via modulo, (c) detect wrap-around, and (d) build the K/V view via ggml_concat of two slices. Quantized cache (TQ3) goes through a TQ3→F16→F32 two-step cast since cpy.cu doesn't support TQ3→F32 directly and concat needs F32. Full-attn donors keep the simple [0, attn_pos) view. Plus per-step diagnostic prints in test driver (draft vs target token). Status: - All three crashes fixed; build clean; runtime no longer aborts. - accept_rate STILL 0% on test prompt — MTP now emits independent varying predictions (e.g. 62542, 8404, 546) that consistently diverge from target's varying predictions (236762, 514, 92450). - Real semantic divergence remains; not a wiring crash. Likely remaining: V permute order, pre_projection input format, or per- block residual sequence detail. Deferred to a focused next session where we can compare h_inner values against a known-good reference. Phase 3 byte-identical gate still met (target-only and --draft-method mtp produce identical output streams when MTP rejects every draft).
…s layers Cross-attention with TQ3_0 KV cache produced accept_rate=0 because three separate issues compounded: 1. K/V views were cast from TQ3_0 to F16/F32 before ggml_flash_attn_ext. The CUDA FA kernels apply forward FWHT to Q (and inverse FWHT to the output) only when they observe K->type == GGML_TYPE_TQ3_0 (fattn-chunked.cu:228,394; fattn-vec.cuh:168). Casting stripped the type tag, FA picked a non-WHT kernel, and Q (real domain) dotted with K (FWHT domain, just unpacked into F16) produced meaningless scores. Removed the cast; Kfa/Vfa now reach FA with native TQ3_0. 2. TQ3_0 K is iterated in 128-element block strides; an unaligned ne[1] reads past the valid window into stale cache cells. Previously we only padded for head_dim>=512; SWA layers (head_dim=256) skipped padding and silently corrupted attention. Extended needs_kv_pad to fire for any TQ3_0 cache, mirroring gemma4_target_graph.cpp's need_256_pad policy. 3. Each layer created its own FA mask input tensor but only the last one was exposed via out.fa_mask. After fix Luce-Org#2 all four layers needed masks; the unfilled mask buffers contained uninitialised CUDA memory (cudaMalloc is not zeroed), causing NaN logits on subsequent steps. Hoisted a single shared mask out of the per-layer loop. The builder now asserts that all need-mask layers want the same (width, kv_seq_len) and fails loudly if a future long-context build wants per-layer masks (SWA cap < full attn_pos), instead of silently doing the wrong thing. Trajectory: pre-fix: accept_rate = 0.00 (varying garbage tokens) fix #1 only: accept_rate = 0.00 (drafts pinned to a single token) fix #1+Luce-Org#2: step 1 OK, step 2+ NaN fix #1+Luce-Org#2+Luce-Org#3: accept_rate = 0.22 (Q4_K_M target + Q8_0 assistant, TQ3_0 KV, 131-token prompt, 64 generation steps) Adjacent infrastructure: - create_gemma4_cache(): extra_q8_layers param to force Q8_0 on specific MTP donor layers when needed. - get_mtp_swa_pattern(): lightweight helper reading MTP SWA layout from GGUF without loading tensors. - MTP loader: load centroids/token_ordering whenever n_centroids>0 (graph builder decides whether to use them). - Test caller: fills out.fa_mask before each compute; dropped the per-step diagnostic prints that are no longer needed. Known follow-ups (not blocking): - Long-context multi-mask: SWA cap < full attn_pos trips the assert. - SWA-wrap branch concat-forces F32 on TQ3_0, losing the WHT path. - Accept rate 0.22 is in expected range; remaining gap to spike's reference numbers may come from quantization, RoPE source, or attention scale.
There was a problem hiding this comment.
2 issues found across 8 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="dflash/test/test_mtp_loader.cpp">
<violation number="1" location="dflash/test/test_mtp_loader.cpp:111">
P2: Donor-layer check is too weak: it only bounds-checks `donor_target_layer` instead of verifying the expected target-layer mapping.</violation>
</file>
<file name="dflash/src/gemma4_mtp_graph.cpp">
<violation number="1" location="dflash/src/gemma4_mtp_graph.cpp:619">
P2: Centroid-head shape/index invariants are assumed but never validated, so mismatched vocab sizes, non-divisible `n_vocab/n_centroids`, or out-of-range `top_k` can crash or silently corrupt logits.</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
| // each MTP layer's donor must be the LAST target layer matching its own | ||
| // SWA/full type. This must be filled by the loader, not hard-coded. | ||
| for (size_t il = 0; il < mtp.layers.size(); ++il) { | ||
| if (mtp.layers[il].donor_target_layer < 0 || |
There was a problem hiding this comment.
P2: Donor-layer check is too weak: it only bounds-checks donor_target_layer instead of verifying the expected target-layer mapping.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dflash/test/test_mtp_loader.cpp, line 111:
<comment>Donor-layer check is too weak: it only bounds-checks `donor_target_layer` instead of verifying the expected target-layer mapping.</comment>
<file context>
@@ -0,0 +1,123 @@
+ // each MTP layer's donor must be the LAST target layer matching its own
+ // SWA/full type. This must be filled by the loader, not hard-coded.
+ for (size_t il = 0; il < mtp.layers.size(); ++il) {
+ if (mtp.layers[il].donor_target_layer < 0 ||
+ mtp.layers[il].donor_target_layer >= 60) {
+ std::fprintf(stderr, " layer %zu donor_target_layer=%d out of [0,60)\n",
</file context>
Tip: Review your code locally with the cubic CLI to iterate faster.
| const int64_t n_c = (int64_t)w.n_centroids; | ||
| const int64_t top_k = (int64_t)w.centroid_top_k; | ||
| // vsc: tokens per centroid slot | ||
| const int64_t vsc = (int64_t)n_vocab / n_c; |
There was a problem hiding this comment.
P2: Centroid-head shape/index invariants are assumed but never validated, so mismatched vocab sizes, non-divisible n_vocab/n_centroids, or out-of-range top_k can crash or silently corrupt logits.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dflash/src/gemma4_mtp_graph.cpp, line 619:
<comment>Centroid-head shape/index invariants are assumed but never validated, so mismatched vocab sizes, non-divisible `n_vocab/n_centroids`, or out-of-range `top_k` can crash or silently corrupt logits.</comment>
<file context>
@@ -0,0 +1,744 @@
+ const int64_t n_c = (int64_t)w.n_centroids;
+ const int64_t top_k = (int64_t)w.centroid_top_k;
+ // vsc: tokens per centroid slot
+ const int64_t vsc = (int64_t)n_vocab / n_c;
+
+ // centroid_logits = mul_mat(centroids, h_inner) → [n_centroids, 1]
</file context>
Summary
Brings the Gemma4 family (31B Dense, 26B-A4B MoE) to production parity with Qwen3.5: chunked batched prefill, DFlash speculative decoding, pFlash block-sparse attention via a new ggml op, and a daemon mode wired into
scripts/server.pyso the OpenAI-compatible HTTP server can serve Gemma4 with--pflash.Benchmarks (RTX 3090, single GPU, Q4_K_M weights, Q8_0 KV)
All numbers post-correctness-fix (SWA mask coordinate frame + Q8/Q4 dequant in sparse FA). Output verified valid Gemma4 vocab tokens.
Gemma-4-31B Dense
Gemma-4-26B-A4B MoE — pFlash big wins at long context
* Decode at 64K shows low DFlash acceptance (draft model diverges from target at very long context); spec decoding is mostly serial. Prefill speedup is unaffected. Investigating draft model context handling separately.
VRAM stays under 21 GB at 64K on the 24 GB RTX 3090.
Speedup scaling
pFlash savings scale with KV-len because attention block selection skips blocks proportional to context length. The progression (4.5% → 16.6% → 53.7% → 101.7%) matches the design.
Highlights
New
GGML_OP_FLASH_ATTN_SPARSEggml op (submodule)ggml-cuda/fattn-sparse.cuwith S↔H transpose for ggml ↔ pFlash layout conversionggml_flash_attn_extwhen no kernel registeredggml_get_to_fp16_cudadequant before the sparse path5be140d feat(ggml): add GGML_OP_FLASH_ATTN_SPARSE ...,866688b feat(ggml-cuda): dequantize K/V to F16 in sparse FA pathpFlash + DFlash + chunked prefill on Gemma4
ggml_flash_attn_sparsewired intobuild_full_attn_block()(full-attention layers only; SWA layers stay on dense FA)pflash_ggml_adapterpflash_supports = {F16, Q8_0, Q4_0}; TQ3_0 falls back to denselast_token_logits_onlyported from upstream PR perf: Replace Q8_0 format for KV with Q4_0 + Rotation, fix window_filled for long context #108 — saves ~1GB output tensor and ~1000x lm_head compute per non-last prefill chunkMajor correctness fix: SWA mask coordinate frame
The host-built SWA causal mask was in absolute KV coordinates but the FA kernel reads it indexed by view position. For every prefill chunk where
kv_start > 0, the mask was misaligned byring_win_startcolumns → kernel saw all-inffor SWA layers → softmax NaN.236770(broken) to236799(correct) after this fix.Fix introduces a shared
compute_swa_view()helper used by both the graph builder and the test driver so K view + mask stay in lockstep.Daemon mode + server routing
test_gemma4_dflash --daemonmirrors the IPC protocol oftest_dflash: line-based stdin commands, int32 LE token stream on--stream-fd=N,-1sentinelscripts/server.pydetects GGUF architecture; routes totest_gemma4_dflashforgemma4, keepstest_dflashfor Qwen3.5--pflashserver flag passes through to the daemon-1sentinel streamed on fd=3 with 26B-A4B + 4096-token promptRun the server
Known limitations
ggml_turbo_whtcalls (mirroring Qwen3.5) didn't resolve it on Gemma4. Needs hardware-level GPU debugging. Q8_0 / Q4_0 / F16 KV all work correctly.Test plan
test_flash_attn_sparse(TDD: dense vs sparse @ alpha=1.0 within BF16 tolerance; alpha<1.0 liveness) — passestest_gemma4_dflash --pflashat 4K/8K Q8 on 31B Dense — output matches baseline at 4K, sparse approx at 8K-1sentinel streamed on fd=3Submodule
Submodule pointer is bumped on
dusterbloom/llama-cpp-turboquant-cudafeature/tq3-kv-cache. The.gitmodulesURL was updated to that fork because the upstreamLuce-Org/llama.cpp-dflash-ggml.gitdoesn't have these commits. Maintainer can rewrite the URL post-merge if the commits get mirrored to a Luce-Org repo.