Skip to content

dflash: split target/draft StepGraphs to fix ggml_gallocr realloc per spec-decode step (issue #55)#62

Open
dusterbloom wants to merge 3 commits intoLuce-Org:mainfrom
dusterbloom:fix/issue-55-stable-kv-pad
Open

dflash: split target/draft StepGraphs to fix ggml_gallocr realloc per spec-decode step (issue #55)#62
dusterbloom wants to merge 3 commits intoLuce-Org:mainfrom
dusterbloom:fix/issue-55-stable-kv-pad

Conversation

@dusterbloom
Copy link
Copy Markdown
Contributor

Fixes #55.

Root cause

Every spec-decode iteration calls build_target_step_tree (target verify, ~3127 ggml graph nodes) at dflash/test/test_dflash.cpp:1703 and build_draft_step (draft forward, ~186 nodes) at dflash/test/test_dflash.cpp:1556 on the same StepGraph sg, sharing one ggml_gallocr. ggml_gallocr_needs_realloc compares galloc->n_nodes to graph->n_nodes, so every call sees a mismatch left over from the previous call's opposite topology — forcing ggml_gallocr_reserve to re-walk the entire graph (CPU cost) and often cudaFree+cudaMalloc the activation buffer (GPU driver cost).

Reporter on Windows/RTX 4090/CUDA 13 sees ggml_gallocr_needs_realloc: graph has different number of nodes log spam every step (the message is #ifndef NDEBUG gated, so Linux Release builds are silent but pay the same cost). Decode tok/s halves from 90 @ 16k context to 55 @ 32k context.

Fix

Split the shared StepGraph sg into target_sg and draft_sg, each with its own ggml_gallocr. Target verify settles into the 3127-node topology, draft into 186-node, neither bounces.

 dflash/test/test_dflash.cpp | 42 ++++++++++++++++++++++++++++--------------

The diff is minimized via a StepGraph & sg = target_sg; alias so the existing prefill/target-verify call sites are unchanged; only the draft block (~10 references) swaps sg.X for draft_sg.X. Daemon-mode reset and the two migrate-cache sites destroy both StepGraphs.

Verification

Patched ggml_gallocr_alloc_graph to unconditionally fprintf to stderr at each "needs_realloc returns true" site (removing the #ifndef NDEBUG gate). Ran test_dflash on a tokenized HE prompt + n_gen=256 + --ddtree-budget=22 + --max-ctx=2048 + --fast-rollback. Same prompt, same flags, before vs after this commit:

Before After
needs_realloc events over 26 steps 56 3 (initial only)
cudaFree+cudaMalloc events during decode 14 0

Reasons breakdown before fix:

  • 26: n_nodes 186 → 3127
  • 25: n_nodes 3127 → 186 (the alternation)
  • 3: per-tensor size grew (monotonic kv_pad growth)
  • 2: initial reserves

After fix: just the 3 initial reserves (one per gallocr, each fired exactly once at first use).

Bench (RTX 3090 / Linux / CUDA 12.6)

bench_he.py --n-gen 128 --ddtree-budget 22, 3-run mean:

  • main: 86.72 tok/s (runs: 85.38, 88.05, 86.72)
  • this fix: 84.99 tok/s (runs: 84.98, 81.71, 88.30)

Within bench-noise. This is consistent with the hypothesis that on Linux/CUDA 12.6 the per-step cudaFree+cudaMalloc cost is small (driver fast-paths the alloc), so eliminating it doesn't show up as decode tok/s. The reporter's Windows/CUDA 13 stack has a slower stream-allocator where the saved cost should translate into measurable tok/s recovery — needs verification on their box.

Test plan

  • Build clean (Release).
  • bench_he.py parity (within noise).
  • Bit-exact correctness preserved across the fix (same step count, same accept length, same final committed token).
  • Instrumentation confirms the realloc-per-step pattern is gone.
  • Reporter (@gtrak) verifies on Windows / RTX 4090 / CUDA 13 — does this recover decode tok/s at 16k+ context?

What this does NOT fix

  • Residual needs_realloc events from monotonic kv_pad growth at long context — these are rare boundary crossings (every ~256 tokens), not per-step. Codex review flagged these as low-severity; not chasing unless the reporter still sees churn.
  • Codex review also flagged the StepGraph & sg = target_sg; alias as a low-severity readability footgun — a follow-up s/sg/target_sg/g would clarify. Held to keep this PR minimal.

…-decode step

Issue Luce-Org#55: every spec-decode iteration calls build_target_step_tree
(target verify, ~3127 graph nodes) and build_draft_step (draft forward,
~186 graph nodes) on the SAME StepGraph, sharing one ggml_gallocr.
ggml_gallocr_needs_realloc compares galloc->n_nodes to graph->n_nodes,
so every call sees a mismatch left over from the previous call's
opposite topology, forcing ggml_gallocr_reserve to re-walk the entire
graph (CPU cost) and often cudaFree+cudaMalloc the activation buffer
(GPU driver cost).  Reporter on Windows/RTX 4090 sees the
"graph has different number of nodes" debug log fire every step and
decode tok/s halving from 90 @ 16k context to 55 @ 32k context.

Fix: introduce target_sg and draft_sg, each with its own ggml_gallocr.
Target verify settles into the 3127-node graph topology, draft into
the 186-node topology, and neither bounces.  Existing prefill /
target-verify call sites keep their `sg` references via a
StepGraph & sg = target_sg alias; only the draft block (~10 calls)
swaps `sg.X` for `draft_sg.X`.  Daemon-mode reset and migrate-cache
sites destroy both StepGraphs.

Verified with one-line instrumentation patch on ggml_gallocr_alloc_graph
(unconditionally fprintf to stderr at each "needs_realloc returns
true" site, removing the #ifndef NDEBUG gate the upstream messages
are silenced by in Release builds).  HE prompt 00 + ddtree-budget=22 +
n_gen=256 over 26 spec-decode steps:

  Before: 56 needs_realloc events (alternating n_nodes 186 ↔ 3127),
          14 cudaFree+cudaMalloc events.
  After:  3 needs_realloc events (initial only: 0 -> 3127, 0 -> 3079,
          0 -> 186), 0 cudaFree+cudaMalloc events during decode.

bench_he.py (RTX 3090, --n-gen 128, --ddtree-budget 22, 3-run mean):
  main:     86.72 tok/s
  this fix: 84.99 tok/s
Within bench-noise on Linux/CUDA 12.6 because cudaMalloc is cheap on
this stack — the saved per-step cost is small.  The reporter's stack
(Windows/CUDA 13/RTX 4090) has a slower stream-allocator where the
saved cost should translate into measurable tok/s recovery; that
needs verification on the reporter's box.
@dusterbloom dusterbloom marked this pull request as draft April 30, 2026 12:56
@dusterbloom dusterbloom marked this pull request as ready for review April 30, 2026 13:15
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 2 files

@dusterbloom dusterbloom force-pushed the fix/issue-55-stable-kv-pad branch from 05cb709 to 0ce6832 Compare May 1, 2026 16:29
@davide221
Copy link
Copy Markdown
Contributor

@gtrak can you verify if this is solved?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Slow decode on RTX 4090 and windows

2 participants