Skip to content

feat(chat): compute-unit picker + ANE-vs-GPU A/B benchmark#99

Open
john-rocky wants to merge 1 commit intomainfrom
feat/compute-unit-picker-and-ab-bench
Open

feat(chat): compute-unit picker + ANE-vs-GPU A/B benchmark#99
john-rocky wants to merge 1 commit intomainfrom
feat/compute-unit-picker-and-ab-bench

Conversation

@john-rocky
Copy link
Copy Markdown
Owner

Summary

  • Add a compute-unit picker (ANE / GPU / All) to the chat app toolbar; selecting a value reloads the current model under that MLComputeUnits setting.
  • Add an A/B benchmark ("Compare ANE vs GPU") under the Bench menu that sequentially reloads the model on each side, runs the existing sustained-throughput benchmark, and reports tok/s, battery drain, and thermal state side-by-side with a faster / lower-drain summary.
  • Make verifyANEPlacement reflect the actual runtime config: it now uses the runner's active computeUnits for decode chunks and mirrors the GPU_PREFILL env var for prefill chunks (previously hardcoded to .cpuAndNeuralEngine, so the audit could disagree with what was loaded).

Why

Comparing on-device runtimes (e.g. ANE vs Metal/litert-llm GPU) requires both an easy switch and an apples-to-apples measurement. Previously the chat app silently picked .cpuAndNeuralEngine and there was no way to A/B without rebuilding.

Caveats handled

  • Vision/audio submodels stay on .cpuAndGPU — that's how CoreMLLLM.load always pins them; the picker only affects the LLM chunks.
  • Reload required to switchMLComputeUnits is bound at load time. Picker change triggers a reload (first-run ANE compile is ~1–2 min and surfaced via the existing loading status).
  • Cool-down between A/B sides — 60s sleep so the thermal state from side A doesn't bleed into side B's drain/thermal numbers.
  • Original CU restored after A/B — the side that was active before the comparison is reloaded at the end.
  • GPU_PREFILL env var honored in placement audit — matches the conditional in ChunkedEngine.load.
  • Charging warning — A/B logs the same charging warning as the regular benchmark since SoC drain is unmeasurable while plugged in.

Test plan

  • Load Gemma 4 E2B on iPhone, switch picker between ANE / GPU / All — model reloads and chat works on each.
  • Run "ANE?" after each switch — header shows the matching cfg=... and per-chunk dispatch counts move between ANE/GPU as expected.
  • Run "Compare ANE vs GPU" 2 min each unplugged — final summary shows tok/s ratio and lower-drain side; runner returns to the originally selected side.
  • Set GPU_PREFILL=1 in the scheme env — "ANE?" reports prefill chunks as GPU even when picker is on ANE.

- LLMRunner: expose computeUnits; loadModel accepts override
- verifyANEPlacement now uses the active computeUnits (and mirrors
  GPU_PREFILL env var for prefill chunks) so the audit matches reality
- Add runABBenchmark: sequential reload+bench for each compute unit,
  60s cool-down between sides, restores original CU at end
- ChatView: toolbar picker (ANE / GPU / All) that triggers a reload,
  and "Compare ANE vs GPU" menu producing side-by-side tok/s, drain,
  thermal results with a faster/lower-drain summary
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant