docs: retract 43 tok/s projection post-D1b failure; 32 tok/s is the current ANE decode ceiling by john-rocky · Pull Request #80 · john-rocky/CoreML-LLM

john-rocky · 2026-04-15T09:40:05Z

Summary

PR docs: retire 56 tok/s target; ANE-native value prop (power + TTFT + 43 tok/s) #78 reframed the value prop around a triad of ~1 W power, ~1 s TTFT, and ~43 tok/s decode. The 43 tok/s number came from PR spike(D1b): compute-unit split — chunk3 on .cpuAndGPU enables overlap #77's compute-unit-split spike projection (max(c1+c2+c4_ANE, c1+c3_GPU) ≈ 23 ms/step) assuming c3 and c4 could overlap.
PR feat(pipelining): chunk3 async on .cpuAndGPU — negative result (STOP, do not merge as default) #79 empirically invalidated that assumption: the full 2-stage pipeline regressed by 24 % on all 4 prompt categories (baseline 32.8–33.2 → pipelined 24.9–25.5 tok/s on Mac Studio), plus a bit-exact failure on summary at token 50 (fp16 rounding between ANE/GPU backends of c3). Root cause is a strict c3 → c4 data dependency that leaves only a ~1 µs Swift dict-build as the within-step overlap window against ~16 ms GPU c3. Cross-step pipelining is blocked by the symmetric token-feedback edge (c3@N+1 needs c4@N).
This PR retracts the 43 tok/s projection on main and propagates the consequence: 32 tok/s is the measured ANE decode ceiling, the decode gap vs LiteRT-LM widens from ~20 % to ~42 %, and item 27 (GPU prefill / TTFT axis) is now the single critical-path decode-adjacent lever. The UX argument (~1 W, ~1 s TTFT, GPU-free host envelope; 32 tok/s is still ~6× human read speed) carries the pitch, not decode parity.
Retract, don't delete. All callouts cite PR feat(pipelining): chunk3 async on .cpuAndGPU — negative result (STOP, do not merge as default) #79 / commit 7c21c7b rather than rewriting the prior reasoning chain.

Files touched (one line each)

docs/MOBILE_2K_COMPETITIVE_PLAN.md — retraction callout at top; value prop one-liner swapped to ~1 W + ~1 s TTFT (projected, item 27) + 32 tok/s (measured, current); competitive table honest about the 42 % decode gap; §"Projection basis" 43 tok/s subsection rewritten as 32 tok/s ceiling (measured) with root-cause analysis; execution table collapses from A+B to B (item 27) only.
docs/PHASE_B_DECISION.md §"What this means for the go-forward target" — D1b item flipped to REGRESSED with structural cause (c3→c4 data dep); item 27 called out as sole tractable decode-adjacent lever; the 43 tok/s claim explicitly retracted inline.
docs/PRIORITY_ROADMAP.md — item 27 footnote added marking it as the single critical-path decode-adjacent item on the roadmap after D1b invalidation.
docs/HANDOFF.md — read-order now includes the D1b failure evidence (PR feat(pipelining): chunk3 async on .cpuAndGPU — negative result (STOP, do not merge as default) #79 + PHASE_D_PIPELINING_IMPL.md on branch); opening prompt retracts 43 tok/s; next-session start options are item 27 OR one of PR feat(pipelining): chunk3 async on .cpuAndGPU — negative result (STOP, do not merge as default) #79's three conversion/-side options (decoupled c4 / speculative h3 / model re-chunking).

Net: +182 / −83 across 4 docs (~99 net lines, under the 150 cap).

Scope

Docs only. Zero code.
Does not touch finding docs (PHASE_B_V3/V4, PHASE_C_, BASELINE_) — historical reasoning preserved.
Note: task item 3 asked for a retraction header on docs/PHASE_D_COMPUTE_UNIT_SPLIT_SPIKE.md (PR spike(D1b): compute-unit split — chunk3 on .cpuAndGPU enables overlap #77's spike doc). That file is not on main — PR spike(D1b): compute-unit split — chunk3 on .cpuAndGPU enables overlap #77 is still OPEN and has not merged — so there is nothing to edit on main. If/when spike(D1b): compute-unit split — chunk3 on .cpuAndGPU enables overlap #77 lands, a follow-up one-liner can add the header there.

Test plan

Diff reads honestly: decode gap is stated as 42 %, not massaged.
No strategic decisions reopened — only the numerical claim changes.
Every retracted claim cites PR feat(pipelining): chunk3 async on .cpuAndGPU — negative result (STOP, do not merge as default) #79 / commit 7c21c7b.
Commit author is John Rocky; no "claude" in message or committer.

…urrent ANE decode ceiling PR #78 reframed the value prop around a triad of ~1 W power, ~1 s TTFT, and ~43 tok/s decode (projected via PR #77's compute-unit-split spike). PR #79 (open) implemented the full 2-stage pipeline that projection required and measured a 24 % regression across all 4 prompt categories, with a bit-exact failure on summary @ token 50 from fp16 rounding between ANE and GPU backends of chunk 3. Root cause: the Gemma-4 chunk graph has a strict c3 → c4 data dep (c4 consumes c3's hidden_states_out). The only within-step overlap window is a ~1 µs Swift dict-build against ~16 ms GPU c3; the cross-step pipeline is blocked by the symmetric token-feedback edge. No non-speculative decode overlap is available on the current graph; PR #79's three future options all require conversion/-side work. This commit retracts the ~43 tok/s projection on main and propagates the consequence: 32 tok/s is the measured ANE decode ceiling, item 27 (GPU prefill / TTFT) is now the single critical-path decode-adjacent lever, and the gap vs LiteRT-LM on decode widens from 20 % to 42 %. The UX argument (~1 W, ~1 s TTFT, GPU-free host envelope) carries the pitch, not decode parity. Touched: - MOBILE_2K_COMPETITIVE_PLAN.md: retraction callout, triad update, competitive table honesty, Projection basis rewrite, D1b removed from execution table (B is now the only item). - PHASE_B_DECISION.md §"What this means for the go-forward target": D1b status flipped to REGRESSED with structural cause, item 27 elevated to sole decode-adjacent lever. - PRIORITY_ROADMAP.md item 27: footnote marking it as the single critical-path decode-adjacent item after D1b invalidation. - HANDOFF.md: read-order includes the D1b failure doc; opening prompt retracts 43 tok/s, next-session starts are item 27 OR one of PR #79's three conversion/-side options. Preserves history — callouts cite PR #79 / commit 7c21c7b rather than rewriting the prior reasoning chain. Total net-added prose ≈ 99 lines across 4 docs. Docs only.

john-rocky merged commit fe646c1 into main Apr 15, 2026

john-rocky deleted the docs/retract-43-tok-s-projection branch April 15, 2026 09:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: retract 43 tok/s projection post-D1b failure; 32 tok/s is the current ANE decode ceiling#80

docs: retract 43 tok/s projection post-D1b failure; 32 tok/s is the current ANE decode ceiling#80
john-rocky merged 1 commit into
mainfrom
docs/retract-43-tok-s-projection

john-rocky commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

john-rocky commented Apr 15, 2026

Summary

Files touched (one line each)

Scope

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant