dflash: split target/draft StepGraphs to fix ggml_gallocr realloc per spec-decode step (issue #55) by dusterbloom · Pull Request #62 · Luce-Org/lucebox-hub

dusterbloom · 2026-04-29T16:56:24Z

Fixes #55.

Root cause

Every spec-decode iteration calls build_target_step_tree (target verify, ~3127 ggml graph nodes) at dflash/test/test_dflash.cpp:1703 and build_draft_step (draft forward, ~186 nodes) at dflash/test/test_dflash.cpp:1556 on the same StepGraph sg, sharing one ggml_gallocr. ggml_gallocr_needs_realloc compares galloc->n_nodes to graph->n_nodes, so every call sees a mismatch left over from the previous call's opposite topology — forcing ggml_gallocr_reserve to re-walk the entire graph (CPU cost) and often cudaFree+cudaMalloc the activation buffer (GPU driver cost).

Reporter on Windows/RTX 4090/CUDA 13 sees ggml_gallocr_needs_realloc: graph has different number of nodes log spam every step (the message is #ifndef NDEBUG gated, so Linux Release builds are silent but pay the same cost). Decode tok/s halves from 90 @ 16k context to 55 @ 32k context.

Fix

Split the shared StepGraph sg into target_sg and draft_sg, each with its own ggml_gallocr. Target verify settles into the 3127-node topology, draft into 186-node, neither bounces.

 dflash/test/test_dflash.cpp | 42 ++++++++++++++++++++++++++++--------------

The diff is minimized via a StepGraph & sg = target_sg; alias so the existing prefill/target-verify call sites are unchanged; only the draft block (~10 references) swaps sg.X for draft_sg.X. Daemon-mode reset and the two migrate-cache sites destroy both StepGraphs.

Verification

Patched ggml_gallocr_alloc_graph to unconditionally fprintf to stderr at each "needs_realloc returns true" site (removing the #ifndef NDEBUG gate). Ran test_dflash on a tokenized HE prompt + n_gen=256 + --ddtree-budget=22 + --max-ctx=2048 + --fast-rollback. Same prompt, same flags, before vs after this commit:

	Before	After
`needs_realloc` events over 26 steps	56	3 (initial only)
`cudaFree+cudaMalloc` events during decode	14	0

Reasons breakdown before fix:

26: n_nodes 186 → 3127
25: n_nodes 3127 → 186 (the alternation)
3: per-tensor size grew (monotonic kv_pad growth)
2: initial reserves

After fix: just the 3 initial reserves (one per gallocr, each fired exactly once at first use).

Bench (RTX 3090 / Linux / CUDA 12.6)

bench_he.py --n-gen 128 --ddtree-budget 22, 3-run mean:

main: 86.72 tok/s (runs: 85.38, 88.05, 86.72)
this fix: 84.99 tok/s (runs: 84.98, 81.71, 88.30)

Within bench-noise. This is consistent with the hypothesis that on Linux/CUDA 12.6 the per-step cudaFree+cudaMalloc cost is small (driver fast-paths the alloc), so eliminating it doesn't show up as decode tok/s. The reporter's Windows/CUDA 13 stack has a slower stream-allocator where the saved cost should translate into measurable tok/s recovery — needs verification on their box.

Test plan

Build clean (Release).
bench_he.py parity (within noise).
Bit-exact correctness preserved across the fix (same step count, same accept length, same final committed token).
Instrumentation confirms the realloc-per-step pattern is gone.
Reporter (@gtrak) verifies on Windows / RTX 4090 / CUDA 13 — does this recover decode tok/s at 16k+ context?

What this does NOT fix

Residual needs_realloc events from monotonic kv_pad growth at long context — these are rare boundary crossings (every ~256 tokens), not per-step. Codex review flagged these as low-severity; not chasing unless the reporter still sees churn.
Codex review also flagged the StepGraph & sg = target_sg; alias as a low-severity readability footgun — a follow-up s/sg/target_sg/g would clarify. Held to keep this PR minimal.

…-decode step Issue Luce-Org#55: every spec-decode iteration calls build_target_step_tree (target verify, ~3127 graph nodes) and build_draft_step (draft forward, ~186 graph nodes) on the SAME StepGraph, sharing one ggml_gallocr. ggml_gallocr_needs_realloc compares galloc->n_nodes to graph->n_nodes, so every call sees a mismatch left over from the previous call's opposite topology, forcing ggml_gallocr_reserve to re-walk the entire graph (CPU cost) and often cudaFree+cudaMalloc the activation buffer (GPU driver cost). Reporter on Windows/RTX 4090 sees the "graph has different number of nodes" debug log fire every step and decode tok/s halving from 90 @ 16k context to 55 @ 32k context. Fix: introduce target_sg and draft_sg, each with its own ggml_gallocr. Target verify settles into the 3127-node graph topology, draft into the 186-node topology, and neither bounces. Existing prefill / target-verify call sites keep their `sg` references via a StepGraph & sg = target_sg alias; only the draft block (~10 calls) swaps `sg.X` for `draft_sg.X`. Daemon-mode reset and migrate-cache sites destroy both StepGraphs. Verified with one-line instrumentation patch on ggml_gallocr_alloc_graph (unconditionally fprintf to stderr at each "needs_realloc returns true" site, removing the #ifndef NDEBUG gate the upstream messages are silenced by in Release builds). HE prompt 00 + ddtree-budget=22 + n_gen=256 over 26 spec-decode steps: Before: 56 needs_realloc events (alternating n_nodes 186 ↔ 3127), 14 cudaFree+cudaMalloc events. After: 3 needs_realloc events (initial only: 0 -> 3127, 0 -> 3079, 0 -> 186), 0 cudaFree+cudaMalloc events during decode. bench_he.py (RTX 3090, --n-gen 128, --ddtree-budget 22, 3-run mean): main: 86.72 tok/s this fix: 84.99 tok/s Within bench-noise on Linux/CUDA 12.6 because cudaMalloc is cheap on this stack — the saved per-step cost is small. The reporter's stack (Windows/CUDA 13/RTX 4090) has a slower stream-allocator where the saved cost should translate into measurable tok/s recovery; that needs verification on the reporter's box.

…v-pad # Conflicts: # dflash/test/test_dflash.cpp

cubic-dev-ai

No issues found across 2 files

davide221 · 2026-05-07T14:05:01Z

@gtrak can you verify if this is solved?

dusterbloom mentioned this pull request Apr 29, 2026

Slow decode on RTX 4090 and windows #55

Open

dusterbloom marked this pull request as draft April 30, 2026 12:56

dusterbloom added 2 commits April 30, 2026 15:07

dflash: preserve daemon cache reuse with split step graphs

c69740d

Merge remote-tracking branch 'origin/main' into fix/issue-55-stable-k…

0ce6832

…v-pad # Conflicts: # dflash/test/test_dflash.cpp

dusterbloom marked this pull request as ready for review April 30, 2026 13:15

cubic-dev-ai Bot reviewed Apr 30, 2026

View reviewed changes

dusterbloom force-pushed the fix/issue-55-stable-kv-pad branch from 05cb709 to 0ce6832 Compare May 1, 2026 16:29

javierpazo mentioned this pull request May 9, 2026

feat(dflash): native multi-request scheduler with batched target step #135

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dflash: split target/draft StepGraphs to fix ggml_gallocr realloc per spec-decode step (issue #55)#62

dflash: split target/draft StepGraphs to fix ggml_gallocr realloc per spec-decode step (issue #55)#62
dusterbloom wants to merge 3 commits intoLuce-Org:mainfrom
dusterbloom:fix/issue-55-stable-kv-pad

dusterbloom commented Apr 29, 2026

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

davide221 commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dusterbloom commented Apr 29, 2026

Root cause

Fix

Verification

Bench (RTX 3090 / Linux / CUDA 12.6)

Test plan

What this does NOT fix

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

davide221 commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants