feat(chat): compute-unit picker + ANE-vs-GPU A/B benchmark by john-rocky · Pull Request #99 · john-rocky/CoreML-LLM

john-rocky · 2026-04-18T07:03:46Z

Summary

Add a compute-unit picker (ANE / GPU / All) to the chat app toolbar; selecting a value reloads the current model under that MLComputeUnits setting.
Add an A/B benchmark ("Compare ANE vs GPU") under the Bench menu that sequentially reloads the model on each side, runs the existing sustained-throughput benchmark, and reports tok/s, battery drain, and thermal state side-by-side with a faster / lower-drain summary.
Make verifyANEPlacement reflect the actual runtime config: it now uses the runner's active computeUnits for decode chunks and mirrors the GPU_PREFILL env var for prefill chunks (previously hardcoded to .cpuAndNeuralEngine, so the audit could disagree with what was loaded).

Why

Comparing on-device runtimes (e.g. ANE vs Metal/litert-llm GPU) requires both an easy switch and an apples-to-apples measurement. Previously the chat app silently picked .cpuAndNeuralEngine and there was no way to A/B without rebuilding.

Caveats handled

Vision/audio submodels stay on .cpuAndGPU — that's how CoreMLLLM.load always pins them; the picker only affects the LLM chunks.
Reload required to switch — MLComputeUnits is bound at load time. Picker change triggers a reload (first-run ANE compile is ~1–2 min and surfaced via the existing loading status).
Cool-down between A/B sides — 60s sleep so the thermal state from side A doesn't bleed into side B's drain/thermal numbers.
Original CU restored after A/B — the side that was active before the comparison is reloaded at the end.
GPU_PREFILL env var honored in placement audit — matches the conditional in ChunkedEngine.load.
Charging warning — A/B logs the same charging warning as the regular benchmark since SoC drain is unmeasurable while plugged in.

Test plan

Load Gemma 4 E2B on iPhone, switch picker between ANE / GPU / All — model reloads and chat works on each.
Run "ANE?" after each switch — header shows the matching cfg=... and per-chunk dispatch counts move between ANE/GPU as expected.
Run "Compare ANE vs GPU" 2 min each unplugged — final summary shows tok/s ratio and lower-drain side; runner returns to the originally selected side.
Set GPU_PREFILL=1 in the scheme env — "ANE?" reports prefill chunks as GPU even when picker is on ANE.

- LLMRunner: expose computeUnits; loadModel accepts override - verifyANEPlacement now uses the active computeUnits (and mirrors GPU_PREFILL env var for prefill chunks) so the audit matches reality - Add runABBenchmark: sequential reload+bench for each compute unit, 60s cool-down between sides, restores original CU at end - ChatView: toolbar picker (ANE / GPU / All) that triggers a reload, and "Compare ANE vs GPU" menu producing side-by-side tok/s, drain, thermal results with a faster/lower-drain summary

john-rocky mentioned this pull request Apr 29, 2026

spike(D1b): compute-unit split — chunk3 on .cpuAndGPU enables overlap #77

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(chat): compute-unit picker + ANE-vs-GPU A/B benchmark#99

feat(chat): compute-unit picker + ANE-vs-GPU A/B benchmark#99
john-rocky wants to merge 1 commit intomainfrom
feat/compute-unit-picker-and-ab-bench

john-rocky commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

john-rocky commented Apr 18, 2026

Summary

Why

Caveats handled

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant