[CoreML EP] Support Gather with scalar 'indices'#28278
Conversation
ee07752 to
28796b4
Compare
The CoreML GatherOpBuilder rejected rank-0 (scalar) indices because CoreML's `gather` requires rank-1+ indices and the workaround would change the output rank. Fixes microsoft#28180 by performing the workaround internally: when indices are scalar, the builder now emits reshape(indices, shape=[1]) -> indices_1d gather(data, indices_1d, axis) -> shape data_shape with axis=1 squeeze(., axes=[axis]) -> ONNX gather output shape in both the MLProgram and NeuralNetwork emitters. We use `reshape` rather than `expand_dims` because CoreML internally pads scalars and expand_dims on the padded tensor can push the apparent rank past CoreML's rank-5 limit. Restrictions: - 'data' must have a fully static shape (we claim a static intermediate shape between gather and squeeze). - 'data' rank capped at 4 (the rank-5 case still trips CoreML's compiler with "Invalid rank: 6", so we keep the conservative boundary). Dynamic-shape and rank-5+ scalar Gather still falls back to CPU. Motivated by StyleGAN-family generators (e.g. GFPGAN), which select per-layer style codes with a scalar-index Gather. Combined with microsoft#28270 (pre-opset-13 Split), GFPGAN-1024 (input 1x3x512x512) goes from 9 partitions / 534 CoreML nodes to 1 partition / 557 CoreML nodes on M3 Max. Adds six tests covering both NeuralNetwork and MLProgram emitters: GatherScalarIndicesAxis1 - axis=1, mid-rank squeeze GatherScalarIndicesAxis0 - axis=0, leading-axis squeeze GatherScalarIndicesNegativeAxis - axis=-1, HandleNegativeAxis path GatherScalarIndicesFloat16 - fp16 data (MLProgram only) GatherScalarIndicesInt64Data - int64 data, both formats GatherScalarIndicesRank4Data - rank-4 data, exercises the cap Updates the comment on the existing GatherWithScalarIndices test to reflect that scalar Gather is now supported when 'data' is fully static and rank-4 or below; the test continues to exercise the dynamic-shape fall-back. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
28796b4 to
8d966dd
Compare
Resolves conflict in coreml_basic_test.cc where this branch's GatherScalarIndices* tests landed in the same file region as the Split11/13/7 tests merged via microsoft#28270. Both sets are preserved sequentially (Gather first, then Split). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Four small follow-ups based on review-style scrutiny of the scalar
'indices' path:
1. Validate `indices` dtype in IsOpSupportedImpl. The MLProgram dispatch
reads `indices_dtype` via GetType and silently defaulted to INT32 if
the call failed. ONNX Gather's schema constrains indices to INT32 or
INT64; reject anything else up front so the dispatch can trust the
value (and ORT_RETURN_IF_NOT the GetType call as a defensive backstop).
2. Add a TODO above the data-rank-4 cap explaining the underlying CoreML
compiler bug ("Invalid rank: 6" on rank-5 reshape+gather intermediates)
and noting the cap can be lifted to 5 when Apple fixes it.
3. Document why the NN-format scalar path uses expand_dims rather than
reshape, parallel to the existing comment on the MLProgram path. NN's
expand_dims doesn't pad rank internally, so we don't hit the
apparent-rank inflation that drove reshape+gather there.
4. Two new negative tests:
- GatherScalarIndicesDynamicDataNotSupported — dynamic 'data' shape
with scalar indices falls back (was rejected silently; now covered).
- GatherScalarIndicesRank5DataNotSupported — rank-5 'data' with
scalar indices falls back (the bug-driven cap above is now tested).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Brief comment explaining that the static_cast normalises the proto- generated enum (whose underlying type isn't guaranteed) to the codebase's int32_t convention for ONNX dtype codes — and that the "INT64" in the constant name refers to the dtype tag (7), not int64 data being narrowed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The proto enum *is* typed (`enum TensorProto_DataType : int` per onnx-ml.pb.h), so the previous "underlying type isn't guaranteed" was wrong. The actual reason for the cast is narrower: `int` and `int32_t` are distinct C++ types (even when layout-compatible on the platforms ORT targets), and the codebase uses `int32_t` for ONNX dtype codes. Comment now states that accurately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The two constexpr dtype tags only ever appeared in `indices_dtype == kInt64 ? kInt64 : kInt32`, which was a no-op once IsOpSupportedImpl gates indices to exactly INT32 or INT64 — the expression simplifies to just `indices_dtype`. Drop the constants and pass `indices_dtype` directly to AddIntermediateOperationOutput. This also makes the dtype handling consistent with the surrounding `output_dtype` line, which already uses an inline static_cast on the proto enum without a named constant. (Net effect: less code, less to read, no behavior change. Spotted in review.) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The other six positive scalar-indices tests use INT64 indices (the PyTorch default), so the INT32-indices code path through both the new dtype gating in IsOpSupportedImpl and the indices_dtype path-through to the reshape's intermediate output was untested. Adds GatherScalarIndicesInt32Indices to cover it on both NN and MLProgram formats. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tyle The earlier rewrite of the Gather builder dropped the "// coreml docs claims validate_indices is optional but in practice it is required" comment and changed `const auto validate_indices = false;` to `constexpr bool validate_indices = false;`. Both were drive-by churn unrelated to the scalar-indices fix this PR is supposed to add. The comment in particular is institutional context — without it, the next reader sees `validate_indices = false` being passed via AddScalarConstant and reasonably wonders why we bother (the CoreML docs say it's optional). Restoring it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
# Conflicts: # onnxruntime/test/providers/coreml/coreml_basic_test.cc
|
@yuslepukhin thank you again for all your support here 👍 |
| } else { | ||
| // gather output here has the data's rank (one more than ONNX scalar-gather output); | ||
| // squeeze the inserted axis to recover the original output shape. | ||
| std::vector<int64_t> gather_shape = data_shape; |
There was a problem hiding this comment.
Prefer TensorShapeVector for inline storage.
| // For scalar inputs, the input shape is modified from [] -> [1] before passing the input to CoreML. | ||
| // This won't work for Gather because the output shape depends on the `indices` input shape which could be a scalar. | ||
| // Currently, we expect the CoreML EP to only take the Shape node in this graph (Gather -> Shape). | ||
| // The CoreML EP supports scalar 'indices' for Gather only when the 'data' input has a fully |
There was a problem hiding this comment.
Minor gap — rank-1 data with scalar indices: No test for e.g. data shape {6} with scalar index (output is scalar/rank-0). If CoreML's squeeze produces a scalar, this should work, but it's untested. The output-rank check data_shape.size() + indices_shape.size() - 1 > 5 evaluates to 1 + 0 - 1 = 0, which passes, and the rank-4 cap also passes (1 <= 4), so the code would attempt it. Whether CoreML handles a rank-0 squeeze output correctly is the question.
| } | ||
| } | ||
|
|
||
| // Output rank = data_rank + indices_rank - 1. The rank-5 limit applies. |
There was a problem hiding this comment.
Potential underflow in output-rank check (line ~225 in the original): data_shape.size() + indices_shape.size() - 1 — when indices is scalar (indices_shape.size() == 0) and data rank is 1, this is 1 + 0 - 1 = 0, which is fine. But for size_t arithmetic, if somehow both were 0 (can't happen since data rank >= 1 is checked), this would underflow. Not a real bug since data rank >= 1 is enforced by ONNX schema and GetShape succeeds, but worth noting the unsigned arithmetic.
This, however, depends on ONNX enforcement code.
Description
The CoreML `GatherOpBuilder` rejected rank-0 (scalar) `indices` because CoreML's `gather` requires rank-1+ indices and the obvious workaround would change the output rank — see the `// Don't allow scalar 'indices' input.` comment at `gather_op_builder.cc:90`. This PR performs the workaround internally:
```
reshape(indices, shape=[1]) -> indices_1d
gather(data, indices_1d, axis) -> data_shape with the gather axis = 1
squeeze(., axes=[axis]) -> ONNX gather output shape
```
…in both the MLProgram and NeuralNetwork emitters. The squeeze restores the original ONNX output rank, so caller-visible Gather semantics are unchanged. `reshape` is used rather than `expand_dims` because CoreML internally pads scalars and `expand_dims` on the padded tensor can push the apparent rank past the rank-5 limit on high-rank `data`.
Restrictions:
Dynamic-shape and rank-5+ scalar Gather still falls back to CPU (preserves the existing `GatherWithScalarIndices` test, whose data is dynamic-shape).
Fixes #28180.
Motivation
StyleGAN-family generators (StyleGAN, StyleGAN2, GFPGAN, …) select per-layer style codes with a scalar-index Gather. The resulting graph alternates between Gather and the rest of the generator, splitting the CoreML subgraph repeatedly.
On GFPGAN-1024 (
[1, 3, 512, 512]), this PR moves all 16 scalar Gathers off CPU and the model lands on a single CoreML partition.M3 Max, MLProgram, batch 1, 3 × 100-iter steady-state runs (n=300):
Mean −8.8%, stddev −50%, P99 −17%, max −18.5% — eliminating the CPU↔CoreML round-trip on every scalar Gather both speeds up the steady state and tightens the tail.
Striking secondary effect: the worst-case run with the fix (88.03 ms) is faster than the mean run without it (89.68 ms). Every single fixed inference over n=300 lands below the unfixed average.
Tests
Six new tests in `onnxruntime/test/providers/coreml/coreml_basic_test.cc` covering distinct code paths, exercised on both NeuralNetwork and MLProgram emitters where the dtype is supported:
Each verifies CoreML output against the CPU EP reference and asserts `ExpectedEPNodeAssignment::All`. The existing `GatherWithScalarIndices` test (dynamic-shape data) is updated only in its comment to reflect the new precise condition; it still exercises the CPU fall-back as before.
All pass locally on macOS 26.3 / M3 Max.