[CoreML EP] Support Gather with scalar 'indices' by maxwbuckley · Pull Request #28278 · microsoft/onnxruntime

maxwbuckley · 2026-04-29T21:14:42Z

Description

The CoreML `GatherOpBuilder` rejected rank-0 (scalar) `indices` because CoreML's `gather` requires rank-1+ indices and the obvious workaround would change the output rank — see the `// Don't allow scalar 'indices' input.` comment at `gather_op_builder.cc:90`. This PR performs the workaround internally:

```
reshape(indices, shape=[1]) -> indices_1d
gather(data, indices_1d, axis) -> data_shape with the gather axis = 1
squeeze(., axes=[axis]) -> ONNX gather output shape
```

…in both the MLProgram and NeuralNetwork emitters. The squeeze restores the original ONNX output rank, so caller-visible Gather semantics are unchanged. `reshape` is used rather than `expand_dims` because CoreML internally pads scalars and `expand_dims` on the padded tensor can push the apparent rank past the rank-5 limit on high-rank `data`.

Restrictions:

`data` must have a fully static shape — we claim a static intermediate shape between gather and squeeze.
`data` rank capped at 4. The rank-5 case still trips CoreML's compiler with `Invalid rank: 6`, so we keep the conservative bound.

Dynamic-shape and rank-5+ scalar Gather still falls back to CPU (preserves the existing `GatherWithScalarIndices` test, whose data is dynamic-shape).

Fixes #28180.

Motivation

StyleGAN-family generators (StyleGAN, StyleGAN2, GFPGAN, …) select per-layer style codes with a scalar-index Gather. The resulting graph alternates between Gather and the rest of the generator, splitting the CoreML subgraph repeatedly.

On GFPGAN-1024 ([1, 3, 512, 512]), this PR moves all 16 scalar Gathers off CPU and the model lands on a single CoreML partition.

M3 Max, MLProgram, batch 1, 3 × 100-iter steady-state runs (n=300):

	Partitions	Mean	StdDev	P50	P95	P99	Max
origin/main	2	89.68 ms	3.67	87.82	96.71	105.00	108.01
this PR	1	81.77 ms	1.85	80.97	85.98	87.24	88.03

Mean −8.8%, stddev −50%, P99 −17%, max −18.5% — eliminating the CPU↔CoreML round-trip on every scalar Gather both speeds up the steady state and tightens the tail.

Striking secondary effect: the worst-case run with the fix (88.03 ms) is faster than the mean run without it (89.68 ms). Every single fixed inference over n=300 lands below the unfixed average.

Tests

Six new tests in `onnxruntime/test/providers/coreml/coreml_basic_test.cc` covering distinct code paths, exercised on both NeuralNetwork and MLProgram emitters where the dtype is supported:

`GatherScalarIndicesAxis1` — axis=1, mid-rank squeeze.
`GatherScalarIndicesAxis0` — axis=0, leading-axis squeeze.
`GatherScalarIndicesNegativeAxis` — axis=-1, exercises `HandleNegativeAxis`.
`GatherScalarIndicesFloat16` — fp16 data (MLProgram only, as per `HasSupportedInputsImpl`).
`GatherScalarIndicesInt64Data` — int64 data, both formats.
`GatherScalarIndicesRank4Data` — rank-4 data, exercises the supported maximum.

Each verifies CoreML output against the CPU EP reference and asserts `ExpectedEPNodeAssignment::All`. The existing `GatherWithScalarIndices` test (dynamic-shape data) is updated only in its comment to reflect the new precise condition; it still exercises the CPU fall-back as before.

All pass locally on macOS 26.3 / M3 Max.

The CoreML GatherOpBuilder rejected rank-0 (scalar) indices because CoreML's `gather` requires rank-1+ indices and the workaround would change the output rank. Fixes microsoft#28180 by performing the workaround internally: when indices are scalar, the builder now emits reshape(indices, shape=[1]) -> indices_1d gather(data, indices_1d, axis) -> shape data_shape with axis=1 squeeze(., axes=[axis]) -> ONNX gather output shape in both the MLProgram and NeuralNetwork emitters. We use `reshape` rather than `expand_dims` because CoreML internally pads scalars and expand_dims on the padded tensor can push the apparent rank past CoreML's rank-5 limit. Restrictions: - 'data' must have a fully static shape (we claim a static intermediate shape between gather and squeeze). - 'data' rank capped at 4 (the rank-5 case still trips CoreML's compiler with "Invalid rank: 6", so we keep the conservative boundary). Dynamic-shape and rank-5+ scalar Gather still falls back to CPU. Motivated by StyleGAN-family generators (e.g. GFPGAN), which select per-layer style codes with a scalar-index Gather. Combined with microsoft#28270 (pre-opset-13 Split), GFPGAN-1024 (input 1x3x512x512) goes from 9 partitions / 534 CoreML nodes to 1 partition / 557 CoreML nodes on M3 Max. Adds six tests covering both NeuralNetwork and MLProgram emitters: GatherScalarIndicesAxis1 - axis=1, mid-rank squeeze GatherScalarIndicesAxis0 - axis=0, leading-axis squeeze GatherScalarIndicesNegativeAxis - axis=-1, HandleNegativeAxis path GatherScalarIndicesFloat16 - fp16 data (MLProgram only) GatherScalarIndicesInt64Data - int64 data, both formats GatherScalarIndicesRank4Data - rank-4 data, exercises the cap Updates the comment on the existing GatherWithScalarIndices test to reflect that scalar Gather is now supported when 'data' is fully static and rank-4 or below; the test continues to exercise the dynamic-shape fall-back. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resolves conflict in coreml_basic_test.cc where this branch's GatherScalarIndices* tests landed in the same file region as the Split11/13/7 tests merged via microsoft#28270. Both sets are preserved sequentially (Gather first, then Split). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Four small follow-ups based on review-style scrutiny of the scalar 'indices' path: 1. Validate `indices` dtype in IsOpSupportedImpl. The MLProgram dispatch reads `indices_dtype` via GetType and silently defaulted to INT32 if the call failed. ONNX Gather's schema constrains indices to INT32 or INT64; reject anything else up front so the dispatch can trust the value (and ORT_RETURN_IF_NOT the GetType call as a defensive backstop). 2. Add a TODO above the data-rank-4 cap explaining the underlying CoreML compiler bug ("Invalid rank: 6" on rank-5 reshape+gather intermediates) and noting the cap can be lifted to 5 when Apple fixes it. 3. Document why the NN-format scalar path uses expand_dims rather than reshape, parallel to the existing comment on the MLProgram path. NN's expand_dims doesn't pad rank internally, so we don't hit the apparent-rank inflation that drove reshape+gather there. 4. Two new negative tests: - GatherScalarIndicesDynamicDataNotSupported — dynamic 'data' shape with scalar indices falls back (was rejected silently; now covered). - GatherScalarIndicesRank5DataNotSupported — rank-5 'data' with scalar indices falls back (the bug-driven cap above is now tested). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Brief comment explaining that the static_cast normalises the proto- generated enum (whose underlying type isn't guaranteed) to the codebase's int32_t convention for ONNX dtype codes — and that the "INT64" in the constant name refers to the dtype tag (7), not int64 data being narrowed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The proto enum *is* typed (`enum TensorProto_DataType : int` per onnx-ml.pb.h), so the previous "underlying type isn't guaranteed" was wrong. The actual reason for the cast is narrower: `int` and `int32_t` are distinct C++ types (even when layout-compatible on the platforms ORT targets), and the codebase uses `int32_t` for ONNX dtype codes. Comment now states that accurately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The two constexpr dtype tags only ever appeared in `indices_dtype == kInt64 ? kInt64 : kInt32`, which was a no-op once IsOpSupportedImpl gates indices to exactly INT32 or INT64 — the expression simplifies to just `indices_dtype`. Drop the constants and pass `indices_dtype` directly to AddIntermediateOperationOutput. This also makes the dtype handling consistent with the surrounding `output_dtype` line, which already uses an inline static_cast on the proto enum without a named constant. (Net effect: less code, less to read, no behavior change. Spotted in review.) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The other six positive scalar-indices tests use INT64 indices (the PyTorch default), so the INT32-indices code path through both the new dtype gating in IsOpSupportedImpl and the indices_dtype path-through to the reshape's intermediate output was untested. Adds GatherScalarIndicesInt32Indices to cover it on both NN and MLProgram formats. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…tyle The earlier rewrite of the Gather builder dropped the "// coreml docs claims validate_indices is optional but in practice it is required" comment and changed `const auto validate_indices = false;` to `constexpr bool validate_indices = false;`. Both were drive-by churn unrelated to the scalar-indices fix this PR is supposed to add. The comment in particular is institutional context — without it, the next reader sees `validate_indices = false` being passed via AddScalarConstant and reasonably wonders why we bother (the CoreML docs say it's optional). Restoring it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

# Conflicts: # onnxruntime/test/providers/coreml/coreml_basic_test.cc

maxwbuckley · 2026-05-12T04:10:01Z

@yuslepukhin thank you again for all your support here 👍

yuslepukhin · 2026-05-14T18:52:55Z

+    } else {
+      // gather output here has the data's rank (one more than ONNX scalar-gather output);
+      // squeeze the inserted axis to recover the original output shape.
+      std::vector<int64_t> gather_shape = data_shape;


Prefer TensorShapeVector for inline storage.

yuslepukhin · 2026-05-14T18:54:38Z

-  // For scalar inputs, the input shape is modified from [] -> [1] before passing the input to CoreML.
-  // This won't work for Gather because the output shape depends on the `indices` input shape which could be a scalar.
-  // Currently, we expect the CoreML EP to only take the Shape node in this graph (Gather -> Shape).
+  // The CoreML EP supports scalar 'indices' for Gather only when the 'data' input has a fully


Minor gap — rank-1 data with scalar indices: No test for e.g. data shape {6} with scalar index (output is scalar/rank-0). If CoreML's squeeze produces a scalar, this should work, but it's untested. The output-rank check data_shape.size() + indices_shape.size() - 1 > 5 evaluates to 1 + 0 - 1 = 0, which passes, and the rank-4 cap also passes (1 <= 4), so the code would attempt it. Whether CoreML handles a rank-0 squeeze output correctly is the question.

yuslepukhin · 2026-05-14T18:57:23Z

+    }
+  }

+  // Output rank = data_rank + indices_rank - 1. The rank-5 limit applies.


Potential underflow in output-rank check (line ~225 in the original): data_shape.size() + indices_shape.size() - 1 — when indices is scalar (indices_shape.size() == 0) and data rank is 1, this is 1 + 0 - 1 = 0, which is fine. But for size_t arithmetic, if somehow both were 0 (can't happen since data rank >= 1 is checked), this would underflow. Not a real bug since data rank >= 1 is enforced by ONNX schema and GetShape succeeds, but worth noting the unsigned arithmetic.
This, however, depends on ONNX enforcement code.

maxwbuckley force-pushed the coreml-gather-scalar branch from ee07752 to 28796b4 Compare April 29, 2026 21:26

maxwbuckley force-pushed the coreml-gather-scalar branch from 28796b4 to 8d966dd Compare April 29, 2026 21:27

maxwbuckley and others added 8 commits May 7, 2026 10:57

Merge remote-tracking branch 'origin/main' into coreml-gather-scalar

887448d

# Conflicts: # onnxruntime/test/providers/coreml/coreml_basic_test.cc

yuslepukhin requested a review from Copilot May 13, 2026 21:10

Copilot started reviewing on behalf of yuslepukhin May 13, 2026 21:42 View session

yuslepukhin reviewed May 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CoreML EP] Support Gather with scalar 'indices'#28278

[CoreML EP] Support Gather with scalar 'indices'#28278
maxwbuckley wants to merge 9 commits into
microsoft:mainfrom
maxwbuckley:coreml-gather-scalar

maxwbuckley commented Apr 29, 2026 •

edited

Loading

Uh oh!

maxwbuckley commented May 12, 2026

Uh oh!

yuslepukhin May 14, 2026

Uh oh!

yuslepukhin May 14, 2026

Uh oh!

yuslepukhin May 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

maxwbuckley commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation

Tests

Uh oh!

maxwbuckley commented May 12, 2026

Uh oh!

yuslepukhin May 14, 2026

Choose a reason for hiding this comment

Uh oh!

yuslepukhin May 14, 2026

Choose a reason for hiding this comment

Uh oh!

yuslepukhin May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

maxwbuckley commented Apr 29, 2026 •

edited

Loading

yuslepukhin May 14, 2026 •

edited

Loading