spike(D1b): compute-unit split — chunk3 on .cpuAndGPU enables overlap#77
Closed
john-rocky wants to merge 2 commits intomainfrom
Closed
spike(D1b): compute-unit split — chunk3 on .cpuAndGPU enables overlap#77john-rocky wants to merge 2 commits intomainfrom
john-rocky wants to merge 2 commits intomainfrom
Conversation
john-rocky
added a commit
that referenced
this pull request
Apr 15, 2026
Implements the 2-stage pipelined decode path proposed in PR #77's spike: chunk3 loaded on .cpuAndGPU with async dispatch on a dedicated DispatchQueue, main thread joins before chunk4. Opt-in via CHUNK_PIPELINE_ENABLED=1; defaults OFF on main (matches drafterUnionEnabled merge discipline). Measurement (Mac Studio, 128-token decode, drafters OFF): chat: 32.80 -> 25.21 tok/s (-23 %) bit-exact PASS code: 33.24 -> 25.50 tok/s (-23 %) bit-exact PASS qa: 33.15 -> 24.86 tok/s (-25 %) bit-exact PASS summary: 33.02 -> 25.43 tok/s (-23 %) bit-exact FAIL @ tok 50 (fp16) STOP condition hit (guardrail: >=+15 % tok/s required). The Gemma-4 chunk graph has a strict linear dependency chain c1->c2->c3->c4 within a step, and c1@N+1 depends on token@N from c4@N across steps. No within-step overlap window exists for the 16 ms GPU c3 to hide behind; the ~1 us dict-build overlap is negligible. Result: the c3-GPU 2.2x slowdown (7.5 -> 16.4 ms) surfaces fully as regression, matching PR #77's prediction exactly. Keeping the plumbing wired (async queue, public toggle, DUMP_TOKEN_IDS in smoke CLI) so a future structural fix (decoupled c4, speculative h3, or model re-chunking) can reuse the scaffolding without reinventing it. Do NOT merge as production default. See docs/PHASE_D_PIPELINING_IMPL.md for full analysis, design, correctness protocol, and next-step options. Net Swift: +124 lines across 3 files.
Closed
5 tasks
john-rocky
added a commit
that referenced
this pull request
Apr 15, 2026
…urrent ANE decode ceiling PR #78 reframed the value prop around a triad of ~1 W power, ~1 s TTFT, and ~43 tok/s decode (projected via PR #77's compute-unit-split spike). PR #79 (open) implemented the full 2-stage pipeline that projection required and measured a 24 % regression across all 4 prompt categories, with a bit-exact failure on summary @ token 50 from fp16 rounding between ANE and GPU backends of chunk 3. Root cause: the Gemma-4 chunk graph has a strict c3 → c4 data dep (c4 consumes c3's hidden_states_out). The only within-step overlap window is a ~1 µs Swift dict-build against ~16 ms GPU c3; the cross-step pipeline is blocked by the symmetric token-feedback edge. No non-speculative decode overlap is available on the current graph; PR #79's three future options all require conversion/-side work. This commit retracts the ~43 tok/s projection on main and propagates the consequence: 32 tok/s is the measured ANE decode ceiling, item 27 (GPU prefill / TTFT) is now the single critical-path decode-adjacent lever, and the gap vs LiteRT-LM on decode widens from 20 % to 42 %. The UX argument (~1 W, ~1 s TTFT, GPU-free host envelope) carries the pitch, not decode parity. Touched: - MOBILE_2K_COMPETITIVE_PLAN.md: retraction callout, triad update, competitive table honesty, Projection basis rewrite, D1b removed from execution table (B is now the only item). - PHASE_B_DECISION.md §"What this means for the go-forward target": D1b status flipped to REGRESSED with structural cause, item 27 elevated to sole decode-adjacent lever. - PRIORITY_ROADMAP.md item 27: footnote marking it as the single critical-path decode-adjacent item after D1b invalidation. - HANDOFF.md: read-order includes the D1b failure doc; opening prompt retracts 43 tok/s, next-session starts are item 27 OR one of PR #79's three conversion/-side options. Preserves history — callouts cite PR #79 / commit 7c21c7b rather than rewriting the prior reasoning chain. Total net-added prose ≈ 99 lines across 4 docs. Docs only.
Merged
4 tasks
john-rocky
added a commit
that referenced
this pull request
Apr 15, 2026
…urrent ANE decode ceiling (#80) PR #78 reframed the value prop around a triad of ~1 W power, ~1 s TTFT, and ~43 tok/s decode (projected via PR #77's compute-unit-split spike). PR #79 (open) implemented the full 2-stage pipeline that projection required and measured a 24 % regression across all 4 prompt categories, with a bit-exact failure on summary @ token 50 from fp16 rounding between ANE and GPU backends of chunk 3. Root cause: the Gemma-4 chunk graph has a strict c3 → c4 data dep (c4 consumes c3's hidden_states_out). The only within-step overlap window is a ~1 µs Swift dict-build against ~16 ms GPU c3; the cross-step pipeline is blocked by the symmetric token-feedback edge. No non-speculative decode overlap is available on the current graph; PR #79's three future options all require conversion/-side work. This commit retracts the ~43 tok/s projection on main and propagates the consequence: 32 tok/s is the measured ANE decode ceiling, item 27 (GPU prefill / TTFT) is now the single critical-path decode-adjacent lever, and the gap vs LiteRT-LM on decode widens from 20 % to 42 %. The UX argument (~1 W, ~1 s TTFT, GPU-free host envelope) carries the pitch, not decode parity. Touched: - MOBILE_2K_COMPETITIVE_PLAN.md: retraction callout, triad update, competitive table honesty, Projection basis rewrite, D1b removed from execution table (B is now the only item). - PHASE_B_DECISION.md §"What this means for the go-forward target": D1b status flipped to REGRESSED with structural cause, item 27 elevated to sole decode-adjacent lever. - PRIORITY_ROADMAP.md item 27: footnote marking it as the single critical-path decode-adjacent item after D1b invalidation. - HANDOFF.md: read-order includes the D1b failure doc; opening prompt retracts 43 tok/s, next-session starts are item 27 OR one of PR #79's three conversion/-side options. Preserves history — callouts cite PR #79 / commit 7c21c7b rather than rewriting the prior reasoning chain. Total net-added prose ≈ 99 lines across 4 docs. Docs only. Co-authored-by: John Rocky <john-rocky@users.noreply.github.com>
Env-gated follow-up to PR #75. When COMPUTE_UNIT_SPLIT=1, chunk3 is loaded with MLModelConfiguration.computeUnits = .cpuAndGPU while the other chunks stay on the inherited unit (usually ANE). A one-shot probe mirrors PR #75's runConcurrencyProbe but pairs c2 (ANE) and c3 (GPU) on separate DispatchQueues. Finding: overlap factor 0.87-0.99 across all four prompt categories (vs 0.02-0.06 on pure-ANE in PR #75). Kernel-level parallelism between ANE and GPU drivers works as hypothesised. Caveat: end-to-end tok/s regresses 33.2 -> 25.3 (-24%) because c3 on GPU is 2.2x slower (7.5 ms -> 16.6 ms) and the current serial predictStep pays that deficit without claiming the overlap prize. Realising the win needs a follow-up pipelining change. Default-off: COMPUTE_UNIT_SPLIT unset produces zero behaviour change. 97 net-added Swift lines (one config branch in load(), one probe). See docs/PHASE_D_COMPUTE_UNIT_SPLIT_SPIKE.md for full data and the projection for a proper 2-stage pipeline (~43 tok/s ceiling, not 56).
Exposes the spike's COMPUTE_UNIT_SPLIT / GPU_PREFILL env gates (plus the base MLComputeUnits) as a user-facing choice at model selection time. Picked value is persisted via UserDefaults (ComputeMode.storageKey) and consumed by LLMRunner.loadModel, which calls setenv() for the gates and threads the matching MLComputeUnits through CoreMLLLM.load. Modes: ANE (default) / GPU / ANE + GPU prefill / ANE + c3→GPU (spike) / All. Applied only at load time — changing the picker without reloading has no effect, matching how CoreML bakes the config into the per-chunk MLModel. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
13950f3 to
80328a3
Compare
Owner
Author
|
Closing: experiment marked 'do not auto-merge'. Overlap finding (0.87-0.99) interesting but final project direction concluded Metal/GPU port does not unlock LiteRT 56.5 ceiling (memory: 56.5 structurally unreachable). compute-unit picker UI superseded by PR #99. Branch spike/d1b-compute-unit-split preserved. Doc docs/PHASE_D_COMPUTE_UNIT_SPLIT_SPIKE.md will be cherry-picked to main in a separate docs PR. |
3 tasks
john-rocky
added a commit
that referenced
this pull request
Apr 29, 2026
…es (#159) Cherry-picks finding documents from PRs #75/#76/#77/#79/#105 (all closed without merge) so the institutional knowledge survives the branch closures. PR #75 (spike/d1-chunk-pipelining): - docs/PHASE_D_PIPELINING_SPIKE.md — ANE serialises MLModel predictions at the driver level (overlap factor 0.02-0.06). PR #76 (feat/c0-verify-logits-output): - docs/PHASE_C_TOLERANCE_FINDINGS.md — tolerance-2 acceptance does not close the bench-vs-oracle gap (0/3 hit on pass bar 2.6). - eval/accept-rate-bench-v6-tolerance.json — measurement data. PR #77 (spike/d1b-compute-unit-split): - docs/PHASE_D_COMPUTE_UNIT_SPLIT_SPIKE.md — kernel-level ANE+GPU overlap works (0.87-0.99) but e2e tok/s regresses 24% from c3-on-GPU 2.2x slowdown. PR #79 (feat/chunk-pipelining-d1b): - docs/PHASE_D_PIPELINING_IMPL.md — productionised 2-stage pipeline regresses 23-25% across 4 categories. Root cause: strict linear chunk dep chain. PR #105 (feat/litert-perf-adoptions): - docs/DRAFTER_DEAD_FOR_E2B.md — drafter dead for E2B (acceptance ceiling). - docs/LITERT_PERF_ADOPTIONS.md — adoption attempt methodology. - docs/MLX_GAP_ANALYSIS.md — MLX comparison. - docs/MAC_BENCH_2026-04-19.md — Mac side-by-side bench data. Co-authored-by: John Rocky <john-rocky@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Env-gated follow-up to PR #75's negative result. This is an EXPERIMENT, do not auto-merge — user reviews.
COMPUTE_UNIT_SPLIT=1loadschunk3with.cpuAndGPU; other chunks inherit ANE. AddedrunComputeUnitSplitProbe()(c2/c3 variant of PR spike(D1): concurrency probe — ANE serialises MLModel predictions #75's probe) that measures serial vs parallel wall-clock on separateDispatchQueues.COMPUTE_UNIT_SPLITunset produces zero behaviour change (verified[Load]output + tok/s unchanged).load(), one probe function). Over the 60-line target but kept in line with PR spike(D1): concurrency probe — ANE serialises MLModel predictions #75's 91-line probe for comparability.Finding
Overlap factor 0.87–0.99 across all four prompt categories (vs 0.02–0.06 on pure-ANE in PR #75). Kernel-level parallelism between ANE and GPU drivers works when two chunks go through distinct driver queues.
Caveat — end-to-end tok/s regresses with the current serial
predictStep:c3 on GPU is 2.2× slower (7.5 → 16.6 ms); serial
predictSteppays that deficit without claiming the overlap prize. Realising the win requires a follow-up pipelining change.Verdict
(a) Overlap works → pursue full compute-unit-split pipelining. Projected ceiling with a proper 2-stage pipeline is ~43 tok/s (not 56). This is the last non-speculative decode lever and refutes the pessimistic reading of PR #75 — Mac CoreML does NOT globally serialise; only per-driver.
See
docs/PHASE_D_COMPUTE_UNIT_SPLIT_SPIKE.mdfor methodology, full data, projections, and concrete next steps.Test plan
[Spike]output when env unset).