dflash: split target/draft StepGraphs to fix ggml_gallocr realloc per spec-decode step (issue #55)#62
Open
dusterbloom wants to merge 3 commits intoLuce-Org:mainfrom
Open
Conversation
…-decode step Issue Luce-Org#55: every spec-decode iteration calls build_target_step_tree (target verify, ~3127 graph nodes) and build_draft_step (draft forward, ~186 graph nodes) on the SAME StepGraph, sharing one ggml_gallocr. ggml_gallocr_needs_realloc compares galloc->n_nodes to graph->n_nodes, so every call sees a mismatch left over from the previous call's opposite topology, forcing ggml_gallocr_reserve to re-walk the entire graph (CPU cost) and often cudaFree+cudaMalloc the activation buffer (GPU driver cost). Reporter on Windows/RTX 4090 sees the "graph has different number of nodes" debug log fire every step and decode tok/s halving from 90 @ 16k context to 55 @ 32k context. Fix: introduce target_sg and draft_sg, each with its own ggml_gallocr. Target verify settles into the 3127-node graph topology, draft into the 186-node topology, and neither bounces. Existing prefill / target-verify call sites keep their `sg` references via a StepGraph & sg = target_sg alias; only the draft block (~10 calls) swaps `sg.X` for `draft_sg.X`. Daemon-mode reset and migrate-cache sites destroy both StepGraphs. Verified with one-line instrumentation patch on ggml_gallocr_alloc_graph (unconditionally fprintf to stderr at each "needs_realloc returns true" site, removing the #ifndef NDEBUG gate the upstream messages are silenced by in Release builds). HE prompt 00 + ddtree-budget=22 + n_gen=256 over 26 spec-decode steps: Before: 56 needs_realloc events (alternating n_nodes 186 ↔ 3127), 14 cudaFree+cudaMalloc events. After: 3 needs_realloc events (initial only: 0 -> 3127, 0 -> 3079, 0 -> 186), 0 cudaFree+cudaMalloc events during decode. bench_he.py (RTX 3090, --n-gen 128, --ddtree-budget 22, 3-run mean): main: 86.72 tok/s this fix: 84.99 tok/s Within bench-noise on Linux/CUDA 12.6 because cudaMalloc is cheap on this stack — the saved per-step cost is small. The reporter's stack (Windows/CUDA 13/RTX 4090) has a slower stream-allocator where the saved cost should translate into measurable tok/s recovery; that needs verification on the reporter's box.
…v-pad # Conflicts: # dflash/test/test_dflash.cpp
05cb709 to
0ce6832
Compare
Contributor
|
@gtrak can you verify if this is solved? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #55.
Root cause
Every spec-decode iteration calls
build_target_step_tree(target verify, ~3127 ggml graph nodes) atdflash/test/test_dflash.cpp:1703andbuild_draft_step(draft forward, ~186 nodes) atdflash/test/test_dflash.cpp:1556on the sameStepGraph sg, sharing oneggml_gallocr.ggml_gallocr_needs_realloccomparesgalloc->n_nodestograph->n_nodes, so every call sees a mismatch left over from the previous call's opposite topology — forcingggml_gallocr_reserveto re-walk the entire graph (CPU cost) and oftencudaFree+cudaMallocthe activation buffer (GPU driver cost).Reporter on Windows/RTX 4090/CUDA 13 sees
ggml_gallocr_needs_realloc: graph has different number of nodeslog spam every step (the message is#ifndef NDEBUGgated, so Linux Release builds are silent but pay the same cost). Decode tok/s halves from 90 @ 16k context to 55 @ 32k context.Fix
Split the shared
StepGraph sgintotarget_sganddraft_sg, each with its ownggml_gallocr. Target verify settles into the 3127-node topology, draft into 186-node, neither bounces.The diff is minimized via a
StepGraph & sg = target_sg;alias so the existing prefill/target-verify call sites are unchanged; only the draft block (~10 references) swapssg.Xfordraft_sg.X. Daemon-mode reset and the two migrate-cache sites destroy both StepGraphs.Verification
Patched
ggml_gallocr_alloc_graphto unconditionallyfprintfto stderr at each "needs_realloc returns true" site (removing the#ifndef NDEBUGgate). Rantest_dflashon a tokenized HE prompt + n_gen=256 + --ddtree-budget=22 + --max-ctx=2048 + --fast-rollback. Same prompt, same flags, before vs after this commit:needs_reallocevents over 26 stepscudaFree+cudaMallocevents during decodeReasons breakdown before fix:
n_nodes 186 → 3127n_nodes 3127 → 186(the alternation)kv_padgrowth)After fix: just the 3 initial reserves (one per gallocr, each fired exactly once at first use).
Bench (RTX 3090 / Linux / CUDA 12.6)
bench_he.py --n-gen 128 --ddtree-budget 22, 3-run mean:Within bench-noise. This is consistent with the hypothesis that on Linux/CUDA 12.6 the per-step
cudaFree+cudaMalloccost is small (driver fast-paths the alloc), so eliminating it doesn't show up as decode tok/s. The reporter's Windows/CUDA 13 stack has a slower stream-allocator where the saved cost should translate into measurable tok/s recovery — needs verification on their box.Test plan
bench_he.pyparity (within noise).What this does NOT fix
needs_reallocevents from monotonickv_padgrowth at long context — these are rare boundary crossings (every ~256 tokens), not per-step. Codex review flagged these as low-severity; not chasing unless the reporter still sees churn.StepGraph & sg = target_sg;alias as a low-severity readability footgun — a follow-ups/sg/target_sg/gwould clarify. Held to keep this PR minimal.