Skip to content

Gemma4 support: pFlash + DFlash + chunked prefill, daemon mode, server routing#131

Open
dusterbloom wants to merge 35 commits intoLuce-Org:mainfrom
dusterbloom:feature/gemma4-support
Open

Gemma4 support: pFlash + DFlash + chunked prefill, daemon mode, server routing#131
dusterbloom wants to merge 35 commits intoLuce-Org:mainfrom
dusterbloom:feature/gemma4-support

Conversation

@dusterbloom
Copy link
Copy Markdown
Contributor

@dusterbloom dusterbloom commented May 8, 2026

Summary

Brings the Gemma4 family (31B Dense, 26B-A4B MoE) to production parity with Qwen3.5: chunked batched prefill, DFlash speculative decoding, pFlash block-sparse attention via a new ggml op, and a daemon mode wired into scripts/server.py so the OpenAI-compatible HTTP server can serve Gemma4 with --pflash.

Benchmarks (RTX 3090, single GPU, Q4_K_M weights, Q8_0 KV)

All numbers post-correctness-fix (SWA mask coordinate frame + Q8/Q4 dequant in sparse FA). Output verified valid Gemma4 vocab tokens.

Gemma-4-31B Dense

Context Prefill baseline Prefill +pflash Speedup Decode +pflash Accept
4K 1438 tok/s 1502 tok/s +4.5% 149 tok/s 10.67/16
8K 1284 tok/s 1497 tok/s +16.6% 100 tok/s 10.67/16

Gemma-4-26B-A4B MoE — pFlash big wins at long context

Context Prefill baseline Prefill +pflash Speedup Decode +pflash Accept
8K 3400 tok/s (TBD) 117 tok/s 8.0/16
32K 2530 tok/s 3888 tok/s +53.7% 133 tok/s 10.67/16
64K 1997 tok/s 4028 tok/s +101.7% 13 tok/s* 1.23/16*

* Decode at 64K shows low DFlash acceptance (draft model diverges from target at very long context); spec decoding is mostly serial. Prefill speedup is unaffected. Investigating draft model context handling separately.

VRAM stays under 21 GB at 64K on the 24 GB RTX 3090.

Speedup scaling

pFlash savings scale with KV-len because attention block selection skips blocks proportional to context length. The progression (4.5% → 16.6% → 53.7% → 101.7%) matches the design.

Highlights

New GGML_OP_FLASH_ATTN_SPARSE ggml op (submodule)

  • CUDA dispatch in ggml-cuda/fattn-sparse.cu with S↔H transpose for ggml ↔ pFlash layout conversion
  • BF16 fast path; falls back to dense ggml_flash_attn_ext when no kernel registered
  • Q8_0 / Q4_0 K/V supported via ggml_get_to_fp16_cuda dequant before the sparse path
  • Submodule commits: 5be140d feat(ggml): add GGML_OP_FLASH_ATTN_SPARSE ..., 866688b feat(ggml-cuda): dequantize K/V to F16 in sparse FA path

pFlash + DFlash + chunked prefill on Gemma4

  • ggml_flash_attn_sparse wired into build_full_attn_block() (full-attention layers only; SWA layers stay on dense FA)
  • pFlash CUDA kernel registered via new pflash_ggml_adapter
  • Type-aware dispatch: pflash_supports = {F16, Q8_0, Q4_0}; TQ3_0 falls back to dense
  • last_token_logits_only ported from upstream PR perf: Replace Q8_0 format for KV with Q4_0 + Rotation, fix window_filled for long context #108 — saves ~1GB output tensor and ~1000x lm_head compute per non-last prefill chunk

Major correctness fix: SWA mask coordinate frame

The host-built SWA causal mask was in absolute KV coordinates but the FA kernel reads it indexed by view position. For every prefill chunk where kv_start > 0, the mask was misaligned by ring_win_start columns → kernel saw all -inf for SWA layers → softmax NaN.

  • Q8/F16 KV: degraded but plausible-looking output (NaN absorbed by saturating arithmetic). Visible: 8K Q8 baseline output token changed from 236770 (broken) to 236799 (correct) after this fix.
  • TQ3_0 KV: clean NaN propagation → argmax returns 0.

Fix introduces a shared compute_swa_view() helper used by both the graph builder and the test driver so K view + mask stay in lockstep.

Daemon mode + server routing

  • test_gemma4_dflash --daemon mirrors the IPC protocol of test_dflash: line-based stdin commands, int32 LE token stream on --stream-fd=N, -1 sentinel
  • scripts/server.py detects GGUF architecture; routes to test_gemma4_dflash for gemma4, keeps test_dflash for Qwen3.5
  • New --pflash server flag passes through to the daemon
  • Smoke test: 16 valid Gemma4 vocab tokens + -1 sentinel streamed on fd=3 with 26B-A4B + 4096-token prompt

Run the server

python3 dflash/scripts/server.py \
  --target /path/to/gemma-4-26B-A4B-it-UD-Q4_K_M.gguf \
  --draft  /path/to/gemma4-26b-a4b-dflash/ \
  --pflash \
  --port 8000

Known limitations

  • TQ3_0 + chunked prefill produces token 0 (independent of pflash). Both a Claude debug-agent and Codex independently identified that the chunked FA kernel doesn't un-rotate FWHT'd K/V. Restoring graph-level ggml_turbo_wht calls (mirroring Qwen3.5) didn't resolve it on Gemma4. Needs hardware-level GPU debugging. Q8_0 / Q4_0 / F16 KV all work correctly.
  • DFlash spec acceptance drops at 64K (avg 1.23/16 vs 10.67/16 at 32K). Prefill speedup unaffected; decode tok/s is essentially serial at 64K. Draft model context handling needs separate investigation.

Test plan

  • test_flash_attn_sparse (TDD: dense vs sparse @ alpha=1.0 within BF16 tolerance; alpha<1.0 liveness) — passes
  • test_gemma4_dflash --pflash at 4K/8K Q8 on 31B Dense — output matches baseline at 4K, sparse approx at 8K
  • Gemma-4-26B-A4B MoE chunked baseline + pflash at 32K (53.7% speedup) and 64K (2.02x speedup)
  • Daemon mode smoke test — 16 valid Gemma4 vocab tokens + -1 sentinel streamed on fd=3
  • Server end-to-end OpenAI API request against deployed Gemma4 (manual)

Submodule

Submodule pointer is bumped on dusterbloom/llama-cpp-turboquant-cuda feature/tq3-kv-cache. The .gitmodules URL was updated to that fork because the upstream Luce-Org/llama.cpp-dflash-ggml.git doesn't have these commits. Maintainer can rewrite the URL post-merge if the commits get mirrored to a Luce-Org repo.

dusterbloom added 20 commits May 7, 2026 10:38
Full implementation of Gemma4 architecture for lucebox-hub DFlash:

Target model (GGUF loader + forward pass graph builder):
- Per-layer head_count_kv array (8 for SWA, 2 for full-attention)
- Dual head_dim: 256 (SWA) / 512 (full-attention) with correct cache sizing
- V=K sharing on full-attention layers (attention_k_eq_v)
- MoE FFN: 128 experts, top-8 routing with shared expert + softmax gating
- Sliding window attention pattern from BOOL GGUF array
- Proportional RoPE (p-RoPE) with per-layer freq_factors
- Embedding scaled by sqrt(hidden_size) per HF reference
- CUDA FA 256-alignment for head_dim>=512 (FATTN_KQ_STRIDE)
- TurboQuant TQ3_0 KV cache with 256-byte alignment padding
- Logit softcapping: 30 * tanh(logits / 30)

Draft model (safetensors loader + forward pass):
- 5-layer transformer with SwiGLU FFN
- FC projection: 6 * target_hidden -> draft_hidden
- Tied LM head using target tok_embd
- Block-diffusion speculative decoding architecture
5 smoke tests validating the Gemma4 implementation:
- smoke_load_gemma4_target: GGUF metadata, per-layer head_kv, SWA pattern
- smoke_gemma4_target_forward: full 26B-A4B forward pass, logits in [-30,30]
- smoke_load_gemma4_draft: safetensors loading, fc/layer shape validation
- smoke_gemma4_draft_forward: draft forward with injected target tok_embd
- test_gemma4_kv_tq3: TQ3 cache 256-alignment, shared layer donors

Plus test_gemma4_dflash driver for combined target+draft benchmarking.
The evenly-spaced formula produced wrong IDs for both Gemma4 variants.
Use the actual values from the z-lab DFlash draft model config.json:
- 26B-A4B (30 layers): {1, 6, 11, 17, 22, 27}
- 31B (60 layers): {1, 12, 23, 35, 46, 57}
Fall back to evenly-spaced for unknown layer counts.
The draft model was stateless (no KV cache), giving 0% speculative
acceptance.  Add prefix-direct KV materialization: target features are
projected through FC → hidden_norm → per-layer K/V, stored in a
dedicated draft KV cache.  The draft forward now attends to this
cache, matching the SGLang/vLLM DFlash architecture.

Gemma4-26B-A4B with draft: avg 10.67 tokens accepted per step,
~250 tok/s decode on RTX 3090 (vs ~67 tok/s baseline).
Replace single-token autoregressive prefill with chunked batched forward.
Each chunk processes up to swa_window tokens in a single GPU dispatch,
cutting prefill from ~66 tok/s to ~830-1060 tok/s on RTX 3090.

Add swa_mask to GemmaGraphInputs so SWA attention layers use a
sliding-window mask during batched prefill while full-attention layers
keep the standard causal mask.
Add --csv flag for direct use with test_gemma4_dflash --tokens.
Default model changed to google/gemma-4-26b-a4b-it. Add --verbose
flag, local_files_only caching, and --add-bos option.
Converts BF16 safetensors draft weights to Q8_0 GGUF format.
Projection weights quantized to Q8_0 (~50% size), norms kept F32.
Includes Gemma4-specific GGUF metadata (sliding_window, logit_softcap,
target_layer_ids). Requires a C++ GGUF loader to be used at inference.
Three bugs prevented coherent speculative decoding output:

1. Missing BOS token: Gemma4 requires BOS (token 2) at position 0.
   Auto-prepend from GGUF bos_token_id when not already present.

2. Missing EOT fallback: many Gemma4 GGUFs omit eot_token_id, so
   eos_chat_id stayed -1 and <end_of_turn> (107) was never caught.
   Default to 107 when the key is absent.

3. Uninitialized SWA mask in speculative verify: when n_tokens > 1,
   build_gemma4_step allocates swa_mask but only attn_mask was filled.
   SWA layers used garbage memory, corrupting all hidden states and
   collapsing output to token 0 (padding) from step 2 onward.

Verified: DFlash now produces identical output to AR baseline and
stops at EOS. Gemma4-31B Q4_K_M + TQ3_0 KV = 80.82 tok/s (2.37x
over AR 34.14 tok/s) on RTX 3090.
… script

load_gemma4_draft_gguf() reads Q8_0-quantized draft weights from GGUF,
auto-detected by .gguf extension on --draft path. Q8_0 drafter matches
BF16 acceptance (AL=6.74) while loading 44% faster and using 380MB less VRAM.

quantize_gemma4_draft_q8.py now reads config.json for model dimensions
instead of hardcoding 26B constants, supporting both 26B-A4B and 31B drafters.
…ttention

Layer-by-layer prefill using FlashPrefill block-sparse WMMA attention for
full-attention layers and ggml FA for SWA layers. Includes gallocr
pre-reserve to eliminate graph allocator overhead and fused [B+SWA] graphs
to reduce hidden_buf round-trips.

Benchmarks at 6K tokens (26B-A4B): 4073 tok/s (+12% over chunked prefill).
Real gains expected at 64K+ where attention density drops below 10%.
Add --pflash, --pflash-alpha, and --tokens-file flags to test harness.
--tokens-file reads comma-separated IDs from a file, bypassing ARG_MAX
limits for prompts >16K tokens.

Fix draft KV cache overflow crash when prompt exceeds draft sliding window
(2096 slots). Clamp prefill to trailing window, adjust ring-buffer offset,
and add defensive assert in build_draft_kv_prefill_graph().
… in FA

FWHT rotation for TQ3_0 KV cache is now handled inside the Flash Attention
CUDA kernel via warp-cooperative shuffle. Remove the separate ggml_turbo_wht
graph ops from build_swa_attn_block() and build_full_attn_block().
SWA layers only need swa_window slots, not the full context. At 64K with
Gemma4 (50 SWA, 10 full-attn layers), this saves 81.8% of KV VRAM.

Ring-buffer read/write positions use modular arithmetic so SWA cache views
never exceed tensor boundaries at long contexts.

Verified: 31B Dense at 64K uses 22.06 GB (target-only), 24.00 GB (full stack
with Q8_0 draft + TQ3_0 KV + DFlash decode at 29.26 tok/s).
After prefill fills all 2096 draft KV slots, the first decode step would
crash with "draft KV overflow". Now wraps draft_kv_pos with modulo
arithmetic, treating the draft cache as a ring buffer.
Decouple Graph A/B chunk size (32K) from SWA window (1K-2K). Batch
consecutive SWA layers into single ggml graphs to reduce graph build
overhead. SWA_CHUNK now tracks actual cache allocation.

Full-attn layers keep the existing Graph A → pFlash → Graph B path.
pFlash integration into single-graph-per-chunk architecture is next.
Replaces the layer-by-layer gemma4_pflash_prefill() with a single-graph-
per-chunk path using the new GGML_OP_FLASH_ATTN_SPARSE op for full-
attention layers. SWA layers continue to use ggml_flash_attn_ext.

Perf (MoE 26B-A4B at 64K, RTX 3090, Q8_0 KV):
  chunked baseline:  1867 tok/s prefill, 100.6 tok/s decode, 10.67/16 accept
  + --pflash:        3374 tok/s prefill (1.81x), 101.8 tok/s decode

Changes:
- Adapter (pflash_ggml_adapter.cpp/h) registers the pFlash CUDA kernel
  with the ggml op. Maps alpha>=1.0 to fully-dense mode.
- build_full_attn_block() conditionally uses ggml_flash_attn_sparse
  when use_pflash is set.
- attn_mask is skipped (in graph + driver) when use_pflash=true since
  the sparse op applies block-level causal internally.
- gemma4_pflash_prefill.cpp removed (replaced by chunked path).
- test/test_flash_attn_sparse.cpp: TDD coverage for the ggml op
  (dense vs sparse @ alpha=1.0 within BF16 precision; alpha<1.0 liveness).

Ported upstream fixes:
- TQ3_0 mask stride (PR Luce-Org#128): bump g_kq_stride_pad to 256 when KV is
  selected via DFLASH27B_KV_K/V env vars. Prevents NaN at chunk sizes
  256/512/1024/2048 with TQ3_0 KV.
- last_token_logits_only (PR Luce-Org#108): skip lm_head matmul over all but
  last token during prefill chunks. Saves ~1GB output tensor and
  ~1000x lm_head compute per chunk on Gemma4-31B (vocab=262144).
Three correctness fixes after benchmarking exposed silent corruption
when --pflash was combined with quantized KV:

1. Graph-level type check in build_full_attn_block: dispatch to
   ggml_flash_attn_sparse only when K/V are F16/Q8_0/Q4_0. TQ3_0 falls
   back to ggml_flash_attn_ext because TQ3's WHT rotation requires
   special handling not yet in the sparse path.

2. Always allocate attn_mask in test_gemma4_dflash (previously skipped
   when use_pflash=true). When some full-attn layers fall back to dense
   FA (non-supported KV types), the mask is required.

3. Guard ggml_backend_tensor_set on attn_mask/swa_mask buffer existence:
   when all full-attn layers use sparse FA, the mask tensor is
   unreferenced by any compute op so gallocr leaves its buffer NULL.
   ggml_set_output is added as a hint but doesn't force allocation;
   skip the write when buffer is NULL. swa_mask gets the same defensive
   check.

Measured on Gemma-4-31B Q4_K_M, RTX 3090, Q8_0 KV:
  4K: 1348 -> 1483 tok/s prefill (+10%), output matches baseline
  8K: 1441 -> 1546 tok/s prefill (+7.3%), block-sparse approximation

Earlier MoE 64K "1.81x speedup" claim was on the broken sparse path
(reading Q8 bytes as F16); that data point is invalid. The current
numbers are on verified-correct execution.

TQ3_0 + chunked path is broken independently of pflash (produces token
0); needs separate debug.
The host-built SWA causal mask was filled in absolute KV coordinates
(mask[q][abs_k] = 0 for valid keys) but the FA CUDA kernel reads it
indexed by view position (k_view = 0..effective_win_len-1, where slot 0
= the cache offset where the K view starts).

For every prefill chunk where kv_start > 0, the K view starts at
ring_win_start in the cache (computed in build_swa_attn_block as
kv_start - swa_window aligned to the ring buffer). The mask cell
[q][k_view=0] was written assuming absolute slot 0, which is far before
the window's lo bound, so it stayed -inf. The kernel then saw every
K-view position as -inf for q rows touching that chunk.

Symptoms:
- Q8/F16 KV: degraded but plausible-looking output (NaNs absorbed by
  saturating arithmetic; argmax landed on some non-zero index)
- TQ3_0 KV: clean NaN propagation through WHT-rotated FA path; argmax
  over NaN-containing logits returns 0 (because `if (x[i] > best)` is
  false for NaN). This is why "TQ3 produces token 0" was the visible
  failure mode.

Fix:
- Add SwaView struct + compute_swa_view() helper in internal.h /
  gemma4_target_graph.cpp encapsulating the
  (abs_win_start, effective_win_len, ring_win_start) math
- build_swa_attn_block calls the helper instead of inlining
- build_swa_causal_mask in test driver takes (abs_win_start, win_len,
  n_tokens, kv_start, swa_window); writes mask[q][k_view] for k_view
  in [0, win_len), using abs_win_start + k_view to check the absolute
  causal window
- swa_mask tensor sized [align_up(effective_win_len, g_kq_stride_pad),
  q_pad] instead of [align_up(kv_len, g_kq_stride_pad), q_pad]
- Both prefill chunk loop and spec-decode verify loop call the helper
  to get matching geometry

Measured impact (Gemma-4-31B Q4_K_M, RTX 3090):
  8K Q8 baseline last sampled token: 236770 (broken) -> 236799 (correct)
  8K Q8 +pflash:                                          1284 -> 1497 tok/s (+16.6%)

Bug entered with chunked prefill (commit 7ce68ac); SWA ring-buffer
(commit f2c36bc) made the offset non-monotonic in kv_start.

The reference Qwen3.5 driver (test/test_dflash.cpp:547-565) already had
this correct via `out_mask[q*kv_pad + (k - win_start)]`.

TQ3_0 still produces token 0 after this fix; that is a separate
TQ3-specific bug.
Wires the Gemma4 binary into scripts/server.py so the OpenAI-compatible
HTTP server can serve Gemma-4-31B and Gemma-4-26B-A4B (with the pFlash
+ DFlash + chunked prefill stack we built this session).

## test/test_gemma4_dflash.cpp

Added a daemon mode that mirrors the IPC protocol used by test_dflash
(Qwen3.5 binary):

- New flags: --daemon, --stream-fd=N, --max-ctx=N (alias for --ctx-size)
- No-op flags accepted for cmdline compatibility with server.py:
  --fast-rollback, --ddtree, --ddtree-budget=B, --ddtree-temp=F,
  --ddtree-no-chain-seed
- After model load, prints "[daemon] ready" to stdout and enters a
  stdin loop reading line-based commands
- Supported command: <prompt_bin_path> <n_gen> [samp=t,p,k,r[,seed]]
- prompt_bin_path is a binary file of int32 LE token IDs
- Each generated token is written as int32 LE to stream_fd; -1 sentinel
  marks end of generation
- Unsupported commands (RESTORE, SNAPSHOT, compress, park, ...) are
  acknowledged with -1 sentinel for now (out of scope for v1)

## scripts/server.py

- _read_gguf_architecture() reads general.architecture from a GGUF
- main() detects "gemma4" and switches DEFAULT_BIN to test_gemma4_dflash
- For Gemma4 the draft argument stays as a directory (matching the
  binary's CLI); for Qwen3 it stays a file as before
- Daemon command is built differently per arch: Gemma4 uses --model /
  --draft named flags and accepts --pflash, Qwen3 keeps the existing
  positional form
- New top-level --pflash flag passes through to the Gemma4 daemon

Smoke-tested locally with the 26B-A4B model + 4096-token prompt, n_gen=16:
daemon prints "[daemon] ready", consumes the binary prompt file, runs
chunked prefill, decodes 16 tokens streamed as int32 LE on fd=3, and
emits the -1 sentinel. Tokens are valid Gemma4 vocab IDs.
The parent's submodule pointer references commits that live only on
github.com/dusterbloom/llama-cpp-turboquant-cuda (our pflash sparse-FA
work). Update .gitmodules so cloners fetch from that fork instead of
the upstream Luce-Org/llama.cpp-dflash-ggml repo (which doesn't have
these commits).

Maintainer can rewrite this URL post-merge if the commits get
mirrored to a Luce-Org repo.
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

11 issues found across 20 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="dflash/src/errors.cpp">

<violation number="1" location="dflash/src/errors.cpp:30">
P2: Returns a pointer into shared mutable error storage after unlocking, so concurrent `set_last_error()` calls can invalidate the returned `const char *`.</violation>
</file>

<file name="dflash/scripts/quantize_gemma4_draft_q8.py">

<violation number="1" location="dflash/scripts/quantize_gemma4_draft_q8.py:227">
P2: Missing validation for empty `target_layer_ids` can crash quantization with modulo-by-zero when computing `TARGET_HIDDEN`.</violation>
</file>

<file name="dflash/scripts/server.py">

<violation number="1" location="dflash/scripts/server.py:69">
P2: Swallowing GGUF read errors here makes Gemma4 detection fail open, so the server silently takes the non-Gemma4 daemon path and uses the wrong argv shape instead of failing explicitly.</violation>

<violation number="2" location="dflash/scripts/server.py:893">
P2: Gemma4 draft path validation accepts non-directory paths by falling back to the parent directory, masking typos and using the wrong draft directory.</violation>
</file>

<file name="dflash/src/gemma4_target_loader.cpp">

<violation number="1" location="dflash/src/gemma4_target_loader.cpp:604">
P2: Failure paths after allocating `out.buf` return without cleaning up partial `GemmaTargetWeights`, so load errors can leak backend memory unless every caller manually frees on failure.</violation>

<violation number="2" location="dflash/src/gemma4_target_loader.cpp:675">
P2: Missing validation that `tok_embd_sz` is divisible by `n_vocab` before deriving `row_bytes` can corrupt embedding row strides for malformed GGUFs.</violation>
</file>

<file name="dflash/CMakeLists.txt">

<violation number="1" location="dflash/CMakeLists.txt:157">
P2: pFlash is gated by the first CUDA arch entry instead of the true minimum SM, which can wrongly enable sm80-only sources for unsorted mixed-arch builds.</violation>
</file>

<file name="dflash/test/test_flash_attn_sparse.cpp">

<violation number="1" location="dflash/test/test_flash_attn_sparse.cpp:107">
P2: The dense-vs-sparse correctness check is too permissive and can mask bad outputs, including non-finite values.</violation>
</file>

<file name="dflash/src/gemma4_dflash_graph.cpp">

<violation number="1" location="dflash/src/gemma4_dflash_graph.cpp:184">
P2: Missing bounds validation for kv_start + n_tokens before KV-cache writes in build_gemma4_draft_graph().</violation>
</file>

<file name="dflash/test/test_gemma4_dflash.cpp">

<violation number="1" location="dflash/test/test_gemma4_dflash.cpp:906">
P2: Daemon requests with the default seed 0 never reseed the shared RNG, so sampling becomes order-dependent across requests.</violation>

<violation number="2" location="dflash/test/test_gemma4_dflash.cpp:1591">
P2: Resetting `draft_kv_pos` to 0 on cache overflow discards the draft context instead of preserving a valid context length, so speculative decoding runs with an empty draft KV cache once capacity is reached.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread dflash/src/errors.cpp
Comment thread dflash/scripts/quantize_gemma4_draft_q8.py
Comment thread dflash/scripts/server.py Outdated
Comment thread dflash/scripts/server.py
Comment thread dflash/src/gemma4_target_loader.cpp
Comment thread dflash/CMakeLists.txt
Comment thread dflash/test/test_flash_attn_sparse.cpp Outdated
Comment thread dflash/src/gemma4_dflash_graph.cpp
Comment thread dflash/test/test_gemma4_dflash.cpp Outdated
Comment thread dflash/test/test_gemma4_dflash.cpp Outdated
@davide221
Copy link
Copy Markdown
Contributor

@dusterbloom can you fix the merge conflict? Great contribution!!

Bundled defensive fixes from code review:

1. errors.cpp: thread-local snapshot of last_error before c_str() return —
   prevents concurrent set_last_error() from invalidating the returned
   pointer across threads.

2. server.py:69: log GGUF read failures to stderr instead of silently
   returning ""; prevents Gemma4 detection from failing open and using
   the wrong daemon argv shape.

3. server.py:893: explicit branches for is_dir / is_file / not-found
   on --draft path; no more silent fallback to parent directory that
   masks user typos.

4. quantize_gemma4_draft_q8.py: confirmed existing N_TARGET_LAYERS == 0
   guard at line 215 prevents the modulo-by-zero (no edit required).

5. gemma4_target_loader.cpp: cleanup_out lambda free's out.buf and
   resets state on every failure path after the buffer allocation —
   prevents backend memory leak on load errors.

6. gemma4_target_loader.cpp: validate tok_embd_sz % n_vocab == 0
   before computing row_bytes — fails fast on malformed GGUFs instead
   of corrupting embedding strides.

7. CMakeLists.txt: replace list(GET _dflash27b_archs 0 ...) with an
   explicit min loop over all configured CUDA arches — pFlash now
   correctly disables when ANY arch in the list is below sm_80.

8. test_flash_attn_sparse.cpp: add explicit non-finite (NaN/inf) check
   in the dense-vs-sparse comparison; printf reports nonfinite=YES/no
   and the return value requires both finite values and max_diff < 1.0.

9. gemma4_dflash_graph.cpp: GGML_ABORT on out-of-bounds kv_start +
   n_tokens at the top of build_gemma4_draft_graph — catch at graph
   build time instead of corrupted-memory crash later.

10. test_gemma4_dflash.cpp daemon: always reseed the RNG per request
    (random_device when seed=0); prevents order-dependent sampling
    across concurrent daemon requests.

11. test_gemma4_dflash.cpp draft KV overflow: replace the hard reset
    cache.draft_kv_pos = 0 with a sliding-window re-prefill from the
    last `keep = dkv_cap - q_len` accepted tokens. This was discarding
    ALL draft context once the ring filled, causing DFlash speculative
    acceptance to crash from 10.67/16 (32K) to 1.23/16 (64K) — matching
    the LongSpec arXiv:2502.17421 long-context regression mode for
    EAGLE-style drafters.

Also includes the WIP TQ3 rotation infrastructure (submodule pointer
bump). Self-test DFLASH_TQ3_VERIFY=1 confirms the rotation is
mathematically reversible (max_diff=0.000000 on roundtrip). TQ3 chunked
output still wrong; the bug is downstream of rotation.
# Conflicts:
#	dflash/deps/llama.cpp
#	dflash/scripts/server.py
Two interlocking bugs were silently corrupting Gemma4 multi-chunk prefill,
producing all-zero decoded tokens (artificially high spec accept rate
because target and drafter both predict token 0 deterministically).

1. SWA ring optimization (swa_ctx_alloc = swa_window + headroom) saves
   VRAM at long contexts but ring-wraps during multi-chunk prefill. The
   K view is constrained to a single contiguous ring slice [ring_win_start,
   ring_size), which on wrap covers only the pre-wrap portion. Post-wrap
   tokens (the latest writes) are silently omitted — queries at positions
   spanning the wrap can't attend to themselves or recent context.

   Pragmatic fix: swa_ctx_alloc = max_ctx_alloc unconditionally. SWA
   layers behave like full-attn during prefill. We lose the VRAM
   optimization but restore correctness. Future work: implement
   double-view SWA reads (concat pre-wrap + post-wrap views) so the
   memory savings can come back without correctness regression.

2. SWA ring-wrap also produced a non-256-aligned win_len_padded clamp
   for TQ3_0 (which requires FATTN_KQ_STRIDE=256), causing SIGSEGV.
   Snap ring_win_start down to the nearest 256-multiple so the K view
   length stays aligned. The mask already excludes the extra padded
   tokens. Now redundant given (1) but kept as a safety net.

Also adds an env-gated [CACHE-WRITE-PROBE] in the test driver
(DFLASH_TQ3_PROBE_CACHE_WRITE=1) for future debugging.

Submodule bump pulls in:
- fix(ggml-cuda): honor view_offs in cpy data pointer
- perf(ggml-cuda): skip cudaMemGetInfo on chunked-FA hot path

Verified end-to-end on RTX 3090:
  Dense 31B + Q8 + draft @ 2.5K  = real tokens (was: all zeros)
  Dense 31B + TQ3 + draft @ 2.5K = real tokens (was: SIGSEGV)
  MoE 26B + TQ3 + draft @ 16K    = real tokens, 1969 tok/s prefill
  Dense 31B + TQ3 + draft @ 4K   = real tokens, 480 tok/s prefill
Replace the disable-fix (swa_ctx_alloc = max_ctx_alloc) with a properly-
sized ring + non-monotonic mask formula. Restores 70-95% SWA cache
VRAM savings at long contexts while keeping multi-chunk correctness.

Architecture:
  - Ring sized to hold the last R = 2 * swa_window keys (= 2 chunks worth).
    Always contains the relevant key window for any chunk, but in non-
    monotonic order after wrap (newest tokens land in pre-wrap slots).
  - K view is ALWAYS the full ring (ring_win_start = 0, len = ring_size).
    The kernel reads the full ring; correctness comes from the mask.
  - build_swa_causal_mask uses an abs_pos formula:
      latest_slot = (kv_end - 1) % ring_size
      offset_back = (latest_slot - k_view + R) % R
      abs_k       = (kv_end - 1) - offset_back
    This handles any wrap pattern correctly.
  - K/V WRITE path splits on wrap: when kv_start % R + n_tokens > R,
    issue two ggml_cpy ops (pre-wrap [write_pos, R) + post-wrap [0, post_n)).
  - compute_swa_view returns full-ring geometry; no truncation, no
    alignment-snap, no contiguous-segment assertion.

Verified on RTX 3090, ~15 min run including TQ3 trifecta:
  T1 single-chunk @ 900 (Q8 + draft):   sampled=236774, real tokens
  T2 2-chunk @ 2.5K (Q8 + draft):       decoded 514, 4755, 822, 2864...
  T3 ring-wrapping @ 8K (Q8 + draft):   1340 tok/s, real tokens
  T4 MoE 16K + TQ3 + draft (the one):   2489 tok/s, swa=2048, saved 72.9%

VRAM at 64K Gemma4-31B: previously 5.5 GB SWA cache (disable-fix),
now ~0.18 GB (50 SWA layers * 2048 * 1792B = 30x reduction).

Submodule bump pulls in the [TQ3-DEQ] printf re-gate.
Adds per-layer KV type machinery + a narrow override that forces Q8_0
on the small subset of full-attn layers whose hidden states are captured
for the DFlash drafter (target_feat ring). Mirrors vLLM's
kv-cache-dtype-skip-layers pattern.

Why: upstream FA dispatch (deps/llama.cpp/.../fattn.cu:441) routes
TQ3_0 + Q->ne[0]>256 to slow CHUNKED kernel. On Dense Gemma4-31B
(full-attn head_dim=512), this is a perf trap. Forcing the drafter's
captured layers to Q8 unblocks the pflash sparse fast path for the
slice the draft consumes.

Gate: kv_type==TQ3 && head_dim>256 && draft wired (capture_layer_ids
non-empty). SWA layers always exempt (don't hit the trap).

Empirical impact (RTX 3090, Dense 31B Q4_K_M + TQ3 + draft + pflash @ 4K):
  - Dense override fires on 2 of 10 full-attn layers (capture IDs 12, 46)
  - Prefill 48 -> 50 tok/s (marginal; 8 remaining full-attn still slow)
  - MoE override fires on 2 of 4 captured (3 keep TQ3); no regression
    (1464 tok/s under GPU contention vs 2489 dedicated)
  - Q8 control unchanged (gate requires TQ3)

Recommendation for production: Dense 31B + draft -> use Q8_0 KV
(505 tok/s prefill in our testing) until an upstream MMA-F16 TQ3
dequant kernel for head_dim=512 lands. TQ3 KV remains optimal for
MoE 26B-A4B (2489 tok/s @ 16K).

Per-layer machinery (kv_k_type_per_layer, kv_v_type_per_layer) is kept
infrastructure for future asymmetric experiments.
Submodule commit 580246202 adds an opt-in (DFLASH_TQ3_MMA=1) route
for TQ3_0 KV through the MMA-F16 tensor-core path:
- New k_tq3_0_dequant_f16_full bulk-dequant kernel
- Intercept in ggml_cuda_flash_attn_ext_mma_f16 with pool-allocated
  f16 K/V temp buffers
- tq3_needs_chunked guard lifted when env var set

Target prefill (Dense 31B + TQ3 + pflash, no draft): 420 -> 610 tok/s.

Note: with --draft enabled, Dense+TQ3 still hits the 9x penalty bug
(separate from FA dispatch). MMA fix is a building block toward closing
the gap.
When --draft is a directory containing both draft-q8_0.gguf (1.6 GB)
and model.safetensors (3 GB BF16), prefer the GGUF. The BF16 safetensors
draft pushed Dense+TQ3 over the 24 GB VRAM ceiling on a 3090, which
fragmented the allocator and triggered host-side cudaStreamSynchronize
stalls (per nsys: 67% of total CUDA time, max sync 1.5s) — collapsing
target prefill from 800+ tok/s to 41 tok/s.

The fix detects this case, logs a warning so the user knows what
happened, and loads the GGUF.

Empirical impact (RTX 3090, draft path = directory):
  Dense 31B + TQ3 + draft + pflash @ 4K:   41 -> 797-852 tok/s  (~20×)
  MoE 26B + TQ3 + draft + pflash @ 16K:    2489 -> 3089 tok/s   (+24%)
  VRAM (MoE 16K):                          24.0 GB -> 19.3 GB

This makes 852 tok/s the new ceiling for our Dense-31B + TQ3 + spec-decode
trifecta on a single RTX 3090, beating the prior best-known by ~6×
(stock llama.cpp/ollama hangs at 3-4K — see ollama#15350).

Bonus: explicit `--draft .../draft-q8_0.gguf` already worked; this
just removes the foot-gun for users passing the directory.
Add --draft-max <N> to runtime-cap the verify batch. The GGUF's
architectural block_size=16 stays validated at load; the new flag just
consumes only the first N draft tokens per cycle. Add --ignore-eos to
measure pure decode speed past natural EOS.

Empirical sweep on chat-style 4K real prompt at temp=0:

  MoE 26B-A4B + TQ3 + DFlash + pflash @ 4K
    dm=4   85.10 t/s  AL=2.88/4   <- baseline 52 t/s, +63%
    dm=8   50.28 t/s  AL=2.08/8
    dm=16  44.12 t/s  AL=2.31/16  <- prior shipped default

  Dense 31B + TQ3 + DFlash + pflash @ 4K (--ignore-eos run)
    dm=4   36.78 t/s  AL=3.51/4
    dm=8   42.07 t/s  AL=5.95/8   <- baseline 22 t/s, +87%
    dm=16  25.74 t/s  AL=3.16/16

block_size=16 was a CEILING, not an optimum. Chat workloads have AL=2-3
(MoE) / AL=3-6 (Dense), so dm=4-8 amortizes the per-step draft cost (5
layers x ~5 ms autoregressive) correctly while dm=16 over-batched and
lost decode throughput.

Per-model optimum differs (MoE: dm=4, Dense: dm=8). Ship as runtime knob;
loader's block_size validation stays unchanged.
… test

Add MtpDrafterWeights + MtpLayerWeights structs to internal.h. Implement
load_gemma4_mtp_assistant() in gemma4_target_loader.cpp to ingest the
AtomicChat-published gemma-4-31B-it-assistant GGUF (Q4_K_M, 49 tensors,
337 MB).

Loader contract (all 7 assertions PASS on the 31B GGUF):
  n_embd_backbone == 5376 (target hidden)
  requires_target_arch == "gemma4"
  4 transformer blocks
  attention_k_eq_v == true
  pre_projection [2*backbone, n_embd] = [10752, 1024]
  post_projection [n_embd, backbone] = [1024, 5376]
  per-layer donor target index in [0, 60) — resolved by SWA-pattern match,
    NOT a hardcoded "last SWA + last full" pair (mirrors atomicbot
    gemma4-assistant.cpp:12-27)

Two surprises vs the plan that change Phase 3:
  * 31B assistant uses CENTROID LM head (n_centroids=2048,
    use_ordered_embeddings=true) — every AtomicChat 31B quant inherits
    this from google/gemma-4-31B-it-assistant. v1 cannot skip centroids.
  * MTP working dim n_embd=1024 differs from backbone 5376; bridged by
    pre/post projection. Added n_embd field to MtpDrafterWeights and
    reads from gemma4_assistant.embedding_length GGUF metadata.

SWA layout on 31B: layers {0,1,2}=SWA, layer 3=full → donors {59,59,59,58}.

Phase 0 spike with atomicbot's built llama-server is NO-GO: their fork
crashes in mmq.cuh:4241 (mmq_x_best=0) on first decode regardless of KV
type, and test-speculative-mtp shows sync vs async draft tokens diverge.
We use their SOURCE as contract reference, not their BUILD as oracle.
The 337 MB Q4_K_M GGUF parses cleanly and serves as our gold input.

Build adds test_mtp_loader as a conditional CMake target. RED-GREEN
locked: same test file that previously failed to compile now exits 0.
Add gemma4_mtp_graph.cpp (503 lines): single-step MTP graph that maps
(last_token, h_prev, pos) -> (logits, h_post, in-graph argmax). Cross-
attention reads target K/V from per-MTP-layer donor (resolved at load).
KV mask shared across gamma steps per MTP.md (all step positions
> attn_pos -> causal/SWA admit uniformly).

Mirror atomicbot/gemma4-assistant.cpp lines 28-130 for the per-step
build, lines 130-220 for the centroid LM head. Use atomicbot only as
contract reference — their llama-server build is broken (mmq.cuh:4241
crash on first decode regardless of KV type).

Add MtpStepGraph struct + build/free decls to internal.h. Add
token_embd.weight optional load to MtpDrafterWeights (will be null on
Q4_K_M, present on F16 — graph picks centroid path when null).

Test (test_mtp_graph_shapes.cpp, 298 lines): builds graph from real
GGUF + stub target, asserts 6 output tensor shapes. PASS on all 6:
  out_logits  [n_vocab=262144, 1] f32
  out_h_post  [n_embd_backbone=5376, 1] f32
  out_argmax  [1] i32
  in_tok      [1] i32
  in_h_prev   [n_embd_backbone, 1] f32
  in_pos      [1] i32

Phase 2 (test_mtp_loader) regression: 7/7 still PASS.

Two surprises caught during build:
  * Dense 31B MTP has variable head_dim per layer type — SWA layers 0-2
    use head_dim_q=256, full-attn layer 3 uses head_dim_q=512. The stale
    GEMMA4_31B_HEAD_DIM=128 in gemma4.h is wrong but unused on this
    path; the new graph derives head_dim from attn_q_norm->ne[0].
  * token_embd.weight absent in Q4_K_M GGUF — fine for centroid path
    (Dense 31B uses centroids + token_ordering for output, target's
    tok_embd for input); a non-centroid drafter would need the F16 tier.

Phase 3b (spec-loop wiring at test_gemma4_dflash.cpp + h_prev capture
at gemma4_target_graph.cpp:1006) deferred to a follow-up commit.
…ntical gate

Phase 3b infrastructure + Phase 3a graph fixes for cross-attention
shape compatibility. End-to-end:

  ./test_gemma4_dflash --model <31B.gguf> --kv-k tq3_0 --kv-v tq3_0
                       --pflash --max-ctx 8192 --tokens-file <4K.csv>
                       --n-predict 32 --temp 0 --seed 0
                       --mtp <31B-assistant.Q4_K_M.gguf>
                       --draft-method mtp

runs to exit 0 and produces a token stream byte-identical to
--draft-method none on the same seed/temp. Regression-free DFlash path
preserved (have_draft path unchanged when --mtp not set).

Files touched:
  test_gemma4_dflash.cpp +290  CLI (--mtp, --draft-method, DraftMethod
                               enum), DraftMethod::Auto resolver, MTP
                               weights/graph init alongside DFlash,
                               mtp_h_prev allocator/buffer in driver,
                               per-step graph rebuild + ggml_gallocr
                               alloc, draft accept/fallback loop, free
                               on cleanup.
  gemma4_target_graph.cpp +23  h_prev capture at the existing capture-
                               layers tap (line ~1006), gated on
                               cache.mtp_h_prev_enabled and the resolved
                               last full-attn layer index.
  internal.h +35               MtpStepGraph struct + build/free decls;
                               mtp_h_prev / mtp_last_full_layer fields
                               on GemmaTargetCache; DraftMethod enum.
  gemma4_target_loader.cpp +18 Optional token_embd.weight load into
                               MtpDrafterWeights.tok_embd (null on
                               Q4_K_M GGUF since centroid head bypasses
                               it).
  gemma4_mtp_graph.cpp +196/-64
                               Cross-attention rewrite: Q/K head_dim
                               reconciled (was 256 vs 128 mismatch
                               that crashed ggml_can_mul_mat). Replaced
                               ggml_flash_attn_ext with manual attn —
                               permute K, ggml_cast quantized→F16/F32,
                               ggml_repeat for GQA, mul_mat → scale →
                               soft_max → mul_mat. The fused FA kernel
                               selector (fattn.cu:652) had no path for
                               the MTP layer's specific (head_dim ×
                               n_head × n_kv) combo on either TQ3 OR
                               F16 KV. Manual attention is general and
                               works for any shape.

Known gap (deferred to Phase 4):
  --draft-method mtp on degenerate-loop prompt shows accept_rate=0.00.
  Byte-identical gate is met (verifier falls back to target's argmax on
  rejection), but MTP itself is predicting wrong tokens. Need a real
  long-form prompt to measure AL properly + diagnose. Possible causes:
  h_prev capture point off, RoPE freqs mismatched, centroid head
  scatter wrong, or KV mask handling on the cross-attn path.

VRAM budget concern: 24.00/24.00 GB on Dense 31B + TQ3 + MTP at 4K.
Per-step graph rebuild also burns time — Phase 4 will need allocator
reuse for any chance of perf, but correctness comes first.
Three focused fixes for Gemma4 MTP draft prediction quality.

(1) Move mtp_h_prev capture from inside the per-layer loop
(gemma4_target_graph.cpp:1047) to AFTER the final RMSNorm
(line 1075). h_prev must be the post-output-norm hidden — the same
vector fed to lm_head — per vLLM PR #41745:569-621 + llama.cpp
PR #22738. Capturing inside the layer loop fed the draft head
pre-norm hiddens it was not trained on.

(2) Wire assistant's own top-level rope_freqs.weight (shape [256] f32)
into MtpDrafterWeights and prefer it for the full-attn MTP layer's
RoPE rotation. Falls back to target.layers[donor_il].rope_freqs only
when the assistant did not ship one (legacy GGUFs). vLLM PR
#41745:422-436 documents that MTP draft must build its own RoPE from
its own rope_parameters[layer_type], not reuse the target's runtime
freqs (which can be quantized or rotated by FWHT in our stack).

(3) KQ scale mismatch in cross-attention: change from
target.attn_scale (1/sqrt(head_dim)) to assistant's
f_attention_scale = 1.0. Confirmed against atomicbot
gemma4-assistant.cpp:139-140 / llama-model.cpp:1651 via Codex audit.
Smoking-gun cause of greedy divergence on every step — wrong scale
produced a different softmax distribution. After this fix, MTP draft
emits independent predictions (e.g. tokens 236772, 1852, 92450, ...)
instead of trivially defaulting to target's argmax (which had been
masking the bug as "byte-identical" while accept_rate stayed 0).

Status:
- Phase 3 byte-identical gate still met (target-only and --draft-method
  mtp produce identical token streams when MTP rejects every draft).
- accept_rate still 0% on degenerate test prompts — MTP now makes real
  (but still wrong) predictions. Remaining suspects per Codex audit
  are GQA head-grouping (item 2), KQ mask handling (item 3), and KV
  view length (item 4). Real-prompt evaluation deferred to a fresh
  Phase 4 run.
Three correctness fixes in cross-attention per Codex audit:

(1) GQA head broadcast (lines ~340-415): replace direct ggml_repeat
(which tiles by modulo: 0,1,...,Hkv-1,0,1,... — interleaved) with a
ggml_view_4d + ggml_cont + ggml_reshape_3d block-broadcast pattern
that produces 0,0,...,1,1,... block layout, matching standard GQA
semantics. Each KV head is now correctly shared by n_head_fa/n_head_kv
consecutive Q heads.

(2) KQ mask (line ~455): replace ggml_soft_max(KQ) with
ggml_soft_max_ext(KQ, KQ_mask, 1.0f, 0.0f) using an all-zero F32 mask.
Atomicbot constructs a mask in llama-graph.cpp:2511-2515; passing a
zero-bias mask matches the "all positions admitted" semantic for
cross-attn while keeping the ext softmax kernel happy.

(3) SWA-aware KV view (lines ~301-355): replace the bare
min(attn_pos, cache_k->ne[1]) clamp with proper ring-buffer wrap
handling. SWA layers now (a) clamp to swa_window-1 admitted positions,
(b) compute ring start slot via modulo, (c) detect wrap-around, and
(d) build the K/V view via ggml_concat of two slices. Quantized cache
(TQ3) goes through a TQ3→F16→F32 two-step cast since cpy.cu doesn't
support TQ3→F32 directly and concat needs F32. Full-attn donors keep
the simple [0, attn_pos) view.

Plus per-step diagnostic prints in test driver (draft vs target token).

Status:
- All three crashes fixed; build clean; runtime no longer aborts.
- accept_rate STILL 0% on test prompt — MTP now emits independent
  varying predictions (e.g. 62542, 8404, 546) that consistently
  diverge from target's varying predictions (236762, 514, 92450).
- Real semantic divergence remains; not a wiring crash. Likely
  remaining: V permute order, pre_projection input format, or per-
  block residual sequence detail. Deferred to a focused next session
  where we can compare h_inner values against a known-good reference.

Phase 3 byte-identical gate still met (target-only and --draft-method
mtp produce identical output streams when MTP rejects every draft).
…s layers

Cross-attention with TQ3_0 KV cache produced accept_rate=0 because
three separate issues compounded:

1. K/V views were cast from TQ3_0 to F16/F32 before ggml_flash_attn_ext.
   The CUDA FA kernels apply forward FWHT to Q (and inverse FWHT to
   the output) only when they observe K->type == GGML_TYPE_TQ3_0
   (fattn-chunked.cu:228,394; fattn-vec.cuh:168). Casting stripped the
   type tag, FA picked a non-WHT kernel, and Q (real domain) dotted
   with K (FWHT domain, just unpacked into F16) produced meaningless
   scores. Removed the cast; Kfa/Vfa now reach FA with native TQ3_0.

2. TQ3_0 K is iterated in 128-element block strides; an unaligned
   ne[1] reads past the valid window into stale cache cells.
   Previously we only padded for head_dim>=512; SWA layers
   (head_dim=256) skipped padding and silently corrupted attention.
   Extended needs_kv_pad to fire for any TQ3_0 cache, mirroring
   gemma4_target_graph.cpp's need_256_pad policy.

3. Each layer created its own FA mask input tensor but only the last
   one was exposed via out.fa_mask. After fix Luce-Org#2 all four layers
   needed masks; the unfilled mask buffers contained uninitialised
   CUDA memory (cudaMalloc is not zeroed), causing NaN logits on
   subsequent steps. Hoisted a single shared mask out of the
   per-layer loop. The builder now asserts that all need-mask layers
   want the same (width, kv_seq_len) and fails loudly if a future
   long-context build wants per-layer masks (SWA cap < full
   attn_pos), instead of silently doing the wrong thing.

Trajectory:
  pre-fix:      accept_rate = 0.00 (varying garbage tokens)
  fix #1 only:  accept_rate = 0.00 (drafts pinned to a single token)
  fix #1+Luce-Org#2:    step 1 OK, step 2+ NaN
  fix #1+Luce-Org#2+Luce-Org#3: accept_rate = 0.22 (Q4_K_M target + Q8_0 assistant,
                TQ3_0 KV, 131-token prompt, 64 generation steps)

Adjacent infrastructure:
- create_gemma4_cache(): extra_q8_layers param to force Q8_0 on
  specific MTP donor layers when needed.
- get_mtp_swa_pattern(): lightweight helper reading MTP SWA layout
  from GGUF without loading tensors.
- MTP loader: load centroids/token_ordering whenever n_centroids>0
  (graph builder decides whether to use them).
- Test caller: fills out.fa_mask before each compute; dropped the
  per-step diagnostic prints that are no longer needed.

Known follow-ups (not blocking):
- Long-context multi-mask: SWA cap < full attn_pos trips the assert.
- SWA-wrap branch concat-forces F32 on TQ3_0, losing the WHT path.
- Accept rate 0.22 is in expected range; remaining gap to spike's
  reference numbers may come from quantization, RoPE source, or
  attention scale.
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 8 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="dflash/test/test_mtp_loader.cpp">

<violation number="1" location="dflash/test/test_mtp_loader.cpp:111">
P2: Donor-layer check is too weak: it only bounds-checks `donor_target_layer` instead of verifying the expected target-layer mapping.</violation>
</file>

<file name="dflash/src/gemma4_mtp_graph.cpp">

<violation number="1" location="dflash/src/gemma4_mtp_graph.cpp:619">
P2: Centroid-head shape/index invariants are assumed but never validated, so mismatched vocab sizes, non-divisible `n_vocab/n_centroids`, or out-of-range `top_k` can crash or silently corrupt logits.</violation>
</file>

Tip: Review your code locally with the cubic CLI to iterate faster.

// each MTP layer's donor must be the LAST target layer matching its own
// SWA/full type. This must be filled by the loader, not hard-coded.
for (size_t il = 0; il < mtp.layers.size(); ++il) {
if (mtp.layers[il].donor_target_layer < 0 ||
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Donor-layer check is too weak: it only bounds-checks donor_target_layer instead of verifying the expected target-layer mapping.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dflash/test/test_mtp_loader.cpp, line 111:

<comment>Donor-layer check is too weak: it only bounds-checks `donor_target_layer` instead of verifying the expected target-layer mapping.</comment>

<file context>
@@ -0,0 +1,123 @@
+    // each MTP layer's donor must be the LAST target layer matching its own
+    // SWA/full type. This must be filled by the loader, not hard-coded.
+    for (size_t il = 0; il < mtp.layers.size(); ++il) {
+        if (mtp.layers[il].donor_target_layer < 0 ||
+            mtp.layers[il].donor_target_layer >= 60) {
+            std::fprintf(stderr, "  layer %zu donor_target_layer=%d out of [0,60)\n",
</file context>

Tip: Review your code locally with the cubic CLI to iterate faster.

const int64_t n_c = (int64_t)w.n_centroids;
const int64_t top_k = (int64_t)w.centroid_top_k;
// vsc: tokens per centroid slot
const int64_t vsc = (int64_t)n_vocab / n_c;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Centroid-head shape/index invariants are assumed but never validated, so mismatched vocab sizes, non-divisible n_vocab/n_centroids, or out-of-range top_k can crash or silently corrupt logits.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dflash/src/gemma4_mtp_graph.cpp, line 619:

<comment>Centroid-head shape/index invariants are assumed but never validated, so mismatched vocab sizes, non-divisible `n_vocab/n_centroids`, or out-of-range `top_k` can crash or silently corrupt logits.</comment>

<file context>
@@ -0,0 +1,744 @@
+        const int64_t n_c   = (int64_t)w.n_centroids;
+        const int64_t top_k = (int64_t)w.centroid_top_k;
+        // vsc: tokens per centroid slot
+        const int64_t vsc   = (int64_t)n_vocab / n_c;
+
+        // centroid_logits = mul_mat(centroids, h_inner) → [n_centroids, 1]
</file context>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants