spike(D1b): compute-unit split — chunk3 on .cpuAndGPU enables overlap by john-rocky · Pull Request #77 · john-rocky/CoreML-LLM

john-rocky · 2026-04-15T09:14:43Z

Summary

Env-gated follow-up to PR #75's negative result. This is an EXPERIMENT, do not auto-merge — user reviews.

COMPUTE_UNIT_SPLIT=1 loads chunk3 with .cpuAndGPU; other chunks inherit ANE. Added runComputeUnitSplitProbe() (c2/c3 variant of PR spike(D1): concurrency probe — ANE serialises MLModel predictions #75's probe) that measures serial vs parallel wall-clock on separate DispatchQueues.
Default-off: COMPUTE_UNIT_SPLIT unset produces zero behaviour change (verified [Load] output + tok/s unchanged).
97 net-added Swift lines (one config branch in load(), one probe function). Over the 60-line target but kept in line with PR spike(D1): concurrency probe — ANE serialises MLModel predictions #75's 91-line probe for comparability.

Finding

Overlap factor 0.87–0.99 across all four prompt categories (vs 0.02–0.06 on pure-ANE in PR #75). Kernel-level parallelism between ANE and GPU drivers works when two chunks go through distinct driver queues.

Category	c2 ms (ANE)	c3 ms (GPU)	seq both	parallel	overlap
chat	6.58	16.05	22.58	16.10	0.99
code	6.61	16.19	22.89	~16.70	0.92
qa	6.62	16.70	23.56	16.75	0.99
summary	6.60	16.15	22.70	16.38	0.97

Caveat — end-to-end tok/s regresses with the current serial predictStep:

Category	baseline	split	Δ
chat	33.17	25.21	−24 %
code	32.90	25.48	−23 %
qa	32.64	25.30	−22 %
summary	33.44	25.46	−24 %

c3 on GPU is 2.2× slower (7.5 → 16.6 ms); serial predictStep pays that deficit without claiming the overlap prize. Realising the win requires a follow-up pipelining change.

Verdict

(a) Overlap works → pursue full compute-unit-split pipelining. Projected ceiling with a proper 2-stage pipeline is ~43 tok/s (not 56). This is the last non-speculative decode lever and refutes the pessimistic reading of PR #75 — Mac CoreML does NOT globally serialise; only per-driver.

See docs/PHASE_D_COMPUTE_UNIT_SPLIT_SPIKE.md for methodology, full data, projections, and concrete next steps.

Test plan

Confirm default-off behaviour matches main (no [Spike] output when env unset).
Re-run on M1 / M2 / M3 machines to confirm finding is not Mac-Studio-specific.
Decide whether to pursue full 2-stage pipeline implementation or merge the probe as a diagnostic only.

Implements the 2-stage pipelined decode path proposed in PR #77's spike: chunk3 loaded on .cpuAndGPU with async dispatch on a dedicated DispatchQueue, main thread joins before chunk4. Opt-in via CHUNK_PIPELINE_ENABLED=1; defaults OFF on main (matches drafterUnionEnabled merge discipline). Measurement (Mac Studio, 128-token decode, drafters OFF): chat: 32.80 -> 25.21 tok/s (-23 %) bit-exact PASS code: 33.24 -> 25.50 tok/s (-23 %) bit-exact PASS qa: 33.15 -> 24.86 tok/s (-25 %) bit-exact PASS summary: 33.02 -> 25.43 tok/s (-23 %) bit-exact FAIL @ tok 50 (fp16) STOP condition hit (guardrail: >=+15 % tok/s required). The Gemma-4 chunk graph has a strict linear dependency chain c1->c2->c3->c4 within a step, and c1@N+1 depends on token@N from c4@N across steps. No within-step overlap window exists for the 16 ms GPU c3 to hide behind; the ~1 us dict-build overlap is negligible. Result: the c3-GPU 2.2x slowdown (7.5 -> 16.4 ms) surfaces fully as regression, matching PR #77's prediction exactly. Keeping the plumbing wired (async queue, public toggle, DUMP_TOKEN_IDS in smoke CLI) so a future structural fix (decoupled c4, speculative h3, or model re-chunking) can reuse the scaffolding without reinventing it. Do NOT merge as production default. See docs/PHASE_D_PIPELINING_IMPL.md for full analysis, design, correctness protocol, and next-step options. Net Swift: +124 lines across 3 files.

…urrent ANE decode ceiling PR #78 reframed the value prop around a triad of ~1 W power, ~1 s TTFT, and ~43 tok/s decode (projected via PR #77's compute-unit-split spike). PR #79 (open) implemented the full 2-stage pipeline that projection required and measured a 24 % regression across all 4 prompt categories, with a bit-exact failure on summary @ token 50 from fp16 rounding between ANE and GPU backends of chunk 3. Root cause: the Gemma-4 chunk graph has a strict c3 → c4 data dep (c4 consumes c3's hidden_states_out). The only within-step overlap window is a ~1 µs Swift dict-build against ~16 ms GPU c3; the cross-step pipeline is blocked by the symmetric token-feedback edge. No non-speculative decode overlap is available on the current graph; PR #79's three future options all require conversion/-side work. This commit retracts the ~43 tok/s projection on main and propagates the consequence: 32 tok/s is the measured ANE decode ceiling, item 27 (GPU prefill / TTFT) is now the single critical-path decode-adjacent lever, and the gap vs LiteRT-LM on decode widens from 20 % to 42 %. The UX argument (~1 W, ~1 s TTFT, GPU-free host envelope) carries the pitch, not decode parity. Touched: - MOBILE_2K_COMPETITIVE_PLAN.md: retraction callout, triad update, competitive table honesty, Projection basis rewrite, D1b removed from execution table (B is now the only item). - PHASE_B_DECISION.md §"What this means for the go-forward target": D1b status flipped to REGRESSED with structural cause, item 27 elevated to sole decode-adjacent lever. - PRIORITY_ROADMAP.md item 27: footnote marking it as the single critical-path decode-adjacent item after D1b invalidation. - HANDOFF.md: read-order includes the D1b failure doc; opening prompt retracts 43 tok/s, next-session starts are item 27 OR one of PR #79's three conversion/-side options. Preserves history — callouts cite PR #79 / commit 7c21c7b rather than rewriting the prior reasoning chain. Total net-added prose ≈ 99 lines across 4 docs. Docs only.

…urrent ANE decode ceiling (#80) PR #78 reframed the value prop around a triad of ~1 W power, ~1 s TTFT, and ~43 tok/s decode (projected via PR #77's compute-unit-split spike). PR #79 (open) implemented the full 2-stage pipeline that projection required and measured a 24 % regression across all 4 prompt categories, with a bit-exact failure on summary @ token 50 from fp16 rounding between ANE and GPU backends of chunk 3. Root cause: the Gemma-4 chunk graph has a strict c3 → c4 data dep (c4 consumes c3's hidden_states_out). The only within-step overlap window is a ~1 µs Swift dict-build against ~16 ms GPU c3; the cross-step pipeline is blocked by the symmetric token-feedback edge. No non-speculative decode overlap is available on the current graph; PR #79's three future options all require conversion/-side work. This commit retracts the ~43 tok/s projection on main and propagates the consequence: 32 tok/s is the measured ANE decode ceiling, item 27 (GPU prefill / TTFT) is now the single critical-path decode-adjacent lever, and the gap vs LiteRT-LM on decode widens from 20 % to 42 %. The UX argument (~1 W, ~1 s TTFT, GPU-free host envelope) carries the pitch, not decode parity. Touched: - MOBILE_2K_COMPETITIVE_PLAN.md: retraction callout, triad update, competitive table honesty, Projection basis rewrite, D1b removed from execution table (B is now the only item). - PHASE_B_DECISION.md §"What this means for the go-forward target": D1b status flipped to REGRESSED with structural cause, item 27 elevated to sole decode-adjacent lever. - PRIORITY_ROADMAP.md item 27: footnote marking it as the single critical-path decode-adjacent item after D1b invalidation. - HANDOFF.md: read-order includes the D1b failure doc; opening prompt retracts 43 tok/s, next-session starts are item 27 OR one of PR #79's three conversion/-side options. Preserves history — callouts cite PR #79 / commit 7c21c7b rather than rewriting the prior reasoning chain. Total net-added prose ≈ 99 lines across 4 docs. Docs only. Co-authored-by: John Rocky <john-rocky@users.noreply.github.com>

Env-gated follow-up to PR #75. When COMPUTE_UNIT_SPLIT=1, chunk3 is loaded with MLModelConfiguration.computeUnits = .cpuAndGPU while the other chunks stay on the inherited unit (usually ANE). A one-shot probe mirrors PR #75's runConcurrencyProbe but pairs c2 (ANE) and c3 (GPU) on separate DispatchQueues. Finding: overlap factor 0.87-0.99 across all four prompt categories (vs 0.02-0.06 on pure-ANE in PR #75). Kernel-level parallelism between ANE and GPU drivers works as hypothesised. Caveat: end-to-end tok/s regresses 33.2 -> 25.3 (-24%) because c3 on GPU is 2.2x slower (7.5 ms -> 16.6 ms) and the current serial predictStep pays that deficit without claiming the overlap prize. Realising the win needs a follow-up pipelining change. Default-off: COMPUTE_UNIT_SPLIT unset produces zero behaviour change. 97 net-added Swift lines (one config branch in load(), one probe). See docs/PHASE_D_COMPUTE_UNIT_SPLIT_SPIKE.md for full data and the projection for a proper 2-stage pipeline (~43 tok/s ceiling, not 56).

Exposes the spike's COMPUTE_UNIT_SPLIT / GPU_PREFILL env gates (plus the base MLComputeUnits) as a user-facing choice at model selection time. Picked value is persisted via UserDefaults (ComputeMode.storageKey) and consumed by LLMRunner.loadModel, which calls setenv() for the gates and threads the matching MLComputeUnits through CoreMLLLM.load. Modes: ANE (default) / GPU / ANE + GPU prefill / ANE + c3→GPU (spike) / All. Applied only at load time — changing the picker without reloading has no effect, matching how CoreML bakes the config into the per-chunk MLModel. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

john-rocky · 2026-04-29T14:02:46Z

Closing: experiment marked 'do not auto-merge'. Overlap finding (0.87-0.99) interesting but final project direction concluded Metal/GPU port does not unlock LiteRT 56.5 ceiling (memory: 56.5 structurally unreachable). compute-unit picker UI superseded by PR #99. Branch spike/d1b-compute-unit-split preserved. Doc docs/PHASE_D_COMPUTE_UNIT_SPLIT_SPIKE.md will be cherry-picked to main in a separate docs PR.

…es (#159) Cherry-picks finding documents from PRs #75/#76/#77/#79/#105 (all closed without merge) so the institutional knowledge survives the branch closures. PR #75 (spike/d1-chunk-pipelining): - docs/PHASE_D_PIPELINING_SPIKE.md — ANE serialises MLModel predictions at the driver level (overlap factor 0.02-0.06). PR #76 (feat/c0-verify-logits-output): - docs/PHASE_C_TOLERANCE_FINDINGS.md — tolerance-2 acceptance does not close the bench-vs-oracle gap (0/3 hit on pass bar 2.6). - eval/accept-rate-bench-v6-tolerance.json — measurement data. PR #77 (spike/d1b-compute-unit-split): - docs/PHASE_D_COMPUTE_UNIT_SPLIT_SPIKE.md — kernel-level ANE+GPU overlap works (0.87-0.99) but e2e tok/s regresses 24% from c3-on-GPU 2.2x slowdown. PR #79 (feat/chunk-pipelining-d1b): - docs/PHASE_D_PIPELINING_IMPL.md — productionised 2-stage pipeline regresses 23-25% across 4 categories. Root cause: strict linear chunk dep chain. PR #105 (feat/litert-perf-adoptions): - docs/DRAFTER_DEAD_FOR_E2B.md — drafter dead for E2B (acceptance ceiling). - docs/LITERT_PERF_ADOPTIONS.md — adoption attempt methodology. - docs/MLX_GAP_ANALYSIS.md — MLX comparison. - docs/MAC_BENCH_2026-04-19.md — Mac side-by-side bench data. Co-authored-by: John Rocky <john-rocky@users.noreply.github.com>

john-rocky mentioned this pull request Apr 15, 2026

feat(pipelining): chunk3 async on .cpuAndGPU — negative result (STOP, do not merge as default) #79

Closed

5 tasks

john-rocky mentioned this pull request Apr 15, 2026

docs: retract 43 tok/s projection post-D1b failure; 32 tok/s is the current ANE decode ceiling #80

Merged

4 tasks

john-rocky and others added 2 commits April 18, 2026 17:25

john-rocky force-pushed the spike/d1b-compute-unit-split branch from 13950f3 to 80328a3 Compare April 18, 2026 08:59

john-rocky closed this Apr 29, 2026

john-rocky mentioned this pull request Apr 29, 2026

docs: preserve negative-result findings from closed experiment branches #159

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spike(D1b): compute-unit split — chunk3 on .cpuAndGPU enables overlap#77

spike(D1b): compute-unit split — chunk3 on .cpuAndGPU enables overlap#77
john-rocky wants to merge 2 commits intomainfrom
spike/d1b-compute-unit-split

john-rocky commented Apr 15, 2026

Uh oh!

john-rocky commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

john-rocky commented Apr 15, 2026

Summary

Finding

Verdict

Test plan

Uh oh!

john-rocky commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant