docs: retract 43 tok/s projection post-D1b failure; 32 tok/s is the current ANE decode ceiling#80
Merged
Merged
Conversation
…urrent ANE decode ceiling PR #78 reframed the value prop around a triad of ~1 W power, ~1 s TTFT, and ~43 tok/s decode (projected via PR #77's compute-unit-split spike). PR #79 (open) implemented the full 2-stage pipeline that projection required and measured a 24 % regression across all 4 prompt categories, with a bit-exact failure on summary @ token 50 from fp16 rounding between ANE and GPU backends of chunk 3. Root cause: the Gemma-4 chunk graph has a strict c3 → c4 data dep (c4 consumes c3's hidden_states_out). The only within-step overlap window is a ~1 µs Swift dict-build against ~16 ms GPU c3; the cross-step pipeline is blocked by the symmetric token-feedback edge. No non-speculative decode overlap is available on the current graph; PR #79's three future options all require conversion/-side work. This commit retracts the ~43 tok/s projection on main and propagates the consequence: 32 tok/s is the measured ANE decode ceiling, item 27 (GPU prefill / TTFT) is now the single critical-path decode-adjacent lever, and the gap vs LiteRT-LM on decode widens from 20 % to 42 %. The UX argument (~1 W, ~1 s TTFT, GPU-free host envelope) carries the pitch, not decode parity. Touched: - MOBILE_2K_COMPETITIVE_PLAN.md: retraction callout, triad update, competitive table honesty, Projection basis rewrite, D1b removed from execution table (B is now the only item). - PHASE_B_DECISION.md §"What this means for the go-forward target": D1b status flipped to REGRESSED with structural cause, item 27 elevated to sole decode-adjacent lever. - PRIORITY_ROADMAP.md item 27: footnote marking it as the single critical-path decode-adjacent item after D1b invalidation. - HANDOFF.md: read-order includes the D1b failure doc; opening prompt retracts 43 tok/s, next-session starts are item 27 OR one of PR #79's three conversion/-side options. Preserves history — callouts cite PR #79 / commit 7c21c7b rather than rewriting the prior reasoning chain. Total net-added prose ≈ 99 lines across 4 docs. Docs only.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
max(c1+c2+c4_ANE, c1+c3_GPU)≈ 23 ms/step) assuming c3 and c4 could overlap.summaryat token 50 (fp16 rounding between ANE/GPU backends of c3). Root cause is a strictc3 → c4data dependency that leaves only a ~1 µs Swift dict-build as the within-step overlap window against ~16 ms GPU c3. Cross-step pipelining is blocked by the symmetric token-feedback edge (c3@N+1 needs c4@N).Files touched (one line each)
docs/MOBILE_2K_COMPETITIVE_PLAN.md— retraction callout at top; value prop one-liner swapped to ~1 W + ~1 s TTFT (projected, item 27) + 32 tok/s (measured, current); competitive table honest about the 42 % decode gap;§"Projection basis"43 tok/ssubsection rewritten as32 tok/s ceiling (measured)with root-cause analysis; execution table collapses from A+B to B (item 27) only.docs/PHASE_B_DECISION.md§"What this means for the go-forward target" — D1b item flipped to REGRESSED with structural cause (c3→c4 data dep); item 27 called out as sole tractable decode-adjacent lever; the 43 tok/s claim explicitly retracted inline.docs/PRIORITY_ROADMAP.md— item 27 footnote added marking it as the single critical-path decode-adjacent item on the roadmap after D1b invalidation.docs/HANDOFF.md— read-order now includes the D1b failure evidence (PR feat(pipelining): chunk3 async on .cpuAndGPU — negative result (STOP, do not merge as default) #79 +PHASE_D_PIPELINING_IMPL.mdon branch); opening prompt retracts 43 tok/s; next-session start options are item 27 OR one of PR feat(pipelining): chunk3 async on .cpuAndGPU — negative result (STOP, do not merge as default) #79's threeconversion/-side options (decoupled c4 / speculative h3 / model re-chunking).Net: +182 / −83 across 4 docs (~99 net lines, under the 150 cap).
Scope
docs/PHASE_D_COMPUTE_UNIT_SPLIT_SPIKE.md(PR spike(D1b): compute-unit split — chunk3 on .cpuAndGPU enables overlap #77's spike doc). That file is not on main — PR spike(D1b): compute-unit split — chunk3 on .cpuAndGPU enables overlap #77 is still OPEN and has not merged — so there is nothing to edit on main. If/when spike(D1b): compute-unit split — chunk3 on .cpuAndGPU enables overlap #77 lands, a follow-up one-liner can add the header there.Test plan