Skip to content

[CoreML EP] Support Gather with scalar 'indices'#28278

Open
maxwbuckley wants to merge 9 commits into
microsoft:mainfrom
maxwbuckley:coreml-gather-scalar
Open

[CoreML EP] Support Gather with scalar 'indices'#28278
maxwbuckley wants to merge 9 commits into
microsoft:mainfrom
maxwbuckley:coreml-gather-scalar

Conversation

@maxwbuckley
Copy link
Copy Markdown
Contributor

@maxwbuckley maxwbuckley commented Apr 29, 2026

Description

The CoreML `GatherOpBuilder` rejected rank-0 (scalar) `indices` because CoreML's `gather` requires rank-1+ indices and the obvious workaround would change the output rank — see the `// Don't allow scalar 'indices' input.` comment at `gather_op_builder.cc:90`. This PR performs the workaround internally:

```
reshape(indices, shape=[1]) -> indices_1d
gather(data, indices_1d, axis) -> data_shape with the gather axis = 1
squeeze(., axes=[axis]) -> ONNX gather output shape
```

…in both the MLProgram and NeuralNetwork emitters. The squeeze restores the original ONNX output rank, so caller-visible Gather semantics are unchanged. `reshape` is used rather than `expand_dims` because CoreML internally pads scalars and `expand_dims` on the padded tensor can push the apparent rank past the rank-5 limit on high-rank `data`.

Restrictions:

  • `data` must have a fully static shape — we claim a static intermediate shape between gather and squeeze.
  • `data` rank capped at 4. The rank-5 case still trips CoreML's compiler with `Invalid rank: 6`, so we keep the conservative bound.

Dynamic-shape and rank-5+ scalar Gather still falls back to CPU (preserves the existing `GatherWithScalarIndices` test, whose data is dynamic-shape).

Fixes #28180.

Motivation

StyleGAN-family generators (StyleGAN, StyleGAN2, GFPGAN, …) select per-layer style codes with a scalar-index Gather. The resulting graph alternates between Gather and the rest of the generator, splitting the CoreML subgraph repeatedly.

On GFPGAN-1024 ([1, 3, 512, 512]), this PR moves all 16 scalar Gathers off CPU and the model lands on a single CoreML partition.

M3 Max, MLProgram, batch 1, 3 × 100-iter steady-state runs (n=300):

Partitions Mean StdDev P50 P95 P99 Max
origin/main 2 89.68 ms 3.67 87.82 96.71 105.00 108.01
this PR 1 81.77 ms 1.85 80.97 85.98 87.24 88.03

Mean −8.8%, stddev −50%, P99 −17%, max −18.5% — eliminating the CPU↔CoreML round-trip on every scalar Gather both speeds up the steady state and tightens the tail.

Striking secondary effect: the worst-case run with the fix (88.03 ms) is faster than the mean run without it (89.68 ms). Every single fixed inference over n=300 lands below the unfixed average.

Tests

Six new tests in `onnxruntime/test/providers/coreml/coreml_basic_test.cc` covering distinct code paths, exercised on both NeuralNetwork and MLProgram emitters where the dtype is supported:

  • `GatherScalarIndicesAxis1` — axis=1, mid-rank squeeze.
  • `GatherScalarIndicesAxis0` — axis=0, leading-axis squeeze.
  • `GatherScalarIndicesNegativeAxis` — axis=-1, exercises `HandleNegativeAxis`.
  • `GatherScalarIndicesFloat16` — fp16 data (MLProgram only, as per `HasSupportedInputsImpl`).
  • `GatherScalarIndicesInt64Data` — int64 data, both formats.
  • `GatherScalarIndicesRank4Data` — rank-4 data, exercises the supported maximum.

Each verifies CoreML output against the CPU EP reference and asserts `ExpectedEPNodeAssignment::All`. The existing `GatherWithScalarIndices` test (dynamic-shape data) is updated only in its comment to reflect the new precise condition; it still exercises the CPU fall-back as before.

All pass locally on macOS 26.3 / M3 Max.

@maxwbuckley maxwbuckley force-pushed the coreml-gather-scalar branch from ee07752 to 28796b4 Compare April 29, 2026 21:26
The CoreML GatherOpBuilder rejected rank-0 (scalar) indices because
CoreML's `gather` requires rank-1+ indices and the workaround would
change the output rank. Fixes microsoft#28180 by performing the workaround
internally: when indices are scalar, the builder now emits

  reshape(indices, shape=[1])     ->  indices_1d
  gather(data, indices_1d, axis)  ->  shape data_shape with axis=1
  squeeze(., axes=[axis])         ->  ONNX gather output shape

in both the MLProgram and NeuralNetwork emitters. We use `reshape`
rather than `expand_dims` because CoreML internally pads scalars and
expand_dims on the padded tensor can push the apparent rank past
CoreML's rank-5 limit.

Restrictions:
  - 'data' must have a fully static shape (we claim a static
    intermediate shape between gather and squeeze).
  - 'data' rank capped at 4 (the rank-5 case still trips CoreML's
    compiler with "Invalid rank: 6", so we keep the conservative
    boundary).

Dynamic-shape and rank-5+ scalar Gather still falls back to CPU.

Motivated by StyleGAN-family generators (e.g. GFPGAN), which select
per-layer style codes with a scalar-index Gather. Combined with
microsoft#28270 (pre-opset-13 Split), GFPGAN-1024 (input 1x3x512x512) goes
from 9 partitions / 534 CoreML nodes to 1 partition / 557 CoreML
nodes on M3 Max.

Adds six tests covering both NeuralNetwork and MLProgram emitters:
  GatherScalarIndicesAxis1         - axis=1, mid-rank squeeze
  GatherScalarIndicesAxis0         - axis=0, leading-axis squeeze
  GatherScalarIndicesNegativeAxis  - axis=-1, HandleNegativeAxis path
  GatherScalarIndicesFloat16       - fp16 data (MLProgram only)
  GatherScalarIndicesInt64Data     - int64 data, both formats
  GatherScalarIndicesRank4Data     - rank-4 data, exercises the cap

Updates the comment on the existing GatherWithScalarIndices test to
reflect that scalar Gather is now supported when 'data' is fully
static and rank-4 or below; the test continues to exercise the
dynamic-shape fall-back.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@maxwbuckley maxwbuckley force-pushed the coreml-gather-scalar branch from 28796b4 to 8d966dd Compare April 29, 2026 21:27
maxwbuckley and others added 8 commits May 7, 2026 10:57
Resolves conflict in coreml_basic_test.cc where this branch's
GatherScalarIndices* tests landed in the same file region as the
Split11/13/7 tests merged via microsoft#28270. Both sets are preserved
sequentially (Gather first, then Split).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Four small follow-ups based on review-style scrutiny of the scalar
'indices' path:

1. Validate `indices` dtype in IsOpSupportedImpl. The MLProgram dispatch
   reads `indices_dtype` via GetType and silently defaulted to INT32 if
   the call failed. ONNX Gather's schema constrains indices to INT32 or
   INT64; reject anything else up front so the dispatch can trust the
   value (and ORT_RETURN_IF_NOT the GetType call as a defensive backstop).

2. Add a TODO above the data-rank-4 cap explaining the underlying CoreML
   compiler bug ("Invalid rank: 6" on rank-5 reshape+gather intermediates)
   and noting the cap can be lifted to 5 when Apple fixes it.

3. Document why the NN-format scalar path uses expand_dims rather than
   reshape, parallel to the existing comment on the MLProgram path. NN's
   expand_dims doesn't pad rank internally, so we don't hit the
   apparent-rank inflation that drove reshape+gather there.

4. Two new negative tests:
   - GatherScalarIndicesDynamicDataNotSupported — dynamic 'data' shape
     with scalar indices falls back (was rejected silently; now covered).
   - GatherScalarIndicesRank5DataNotSupported — rank-5 'data' with
     scalar indices falls back (the bug-driven cap above is now tested).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Brief comment explaining that the static_cast normalises the proto-
generated enum (whose underlying type isn't guaranteed) to the
codebase's int32_t convention for ONNX dtype codes — and that the
"INT64" in the constant name refers to the dtype tag (7), not int64
data being narrowed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The proto enum *is* typed (`enum TensorProto_DataType : int` per
onnx-ml.pb.h), so the previous "underlying type isn't guaranteed" was
wrong. The actual reason for the cast is narrower: `int` and `int32_t`
are distinct C++ types (even when layout-compatible on the platforms
ORT targets), and the codebase uses `int32_t` for ONNX dtype codes.
Comment now states that accurately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The two constexpr dtype tags only ever appeared in
`indices_dtype == kInt64 ? kInt64 : kInt32`, which was a no-op once
IsOpSupportedImpl gates indices to exactly INT32 or INT64 — the
expression simplifies to just `indices_dtype`. Drop the constants and
pass `indices_dtype` directly to AddIntermediateOperationOutput.

This also makes the dtype handling consistent with the surrounding
`output_dtype` line, which already uses an inline static_cast on the
proto enum without a named constant.

(Net effect: less code, less to read, no behavior change. Spotted in
review.)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The other six positive scalar-indices tests use INT64 indices (the
PyTorch default), so the INT32-indices code path through both the new
dtype gating in IsOpSupportedImpl and the indices_dtype path-through
to the reshape's intermediate output was untested. Adds
GatherScalarIndicesInt32Indices to cover it on both NN and MLProgram
formats.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tyle

The earlier rewrite of the Gather builder dropped the
"// coreml docs claims validate_indices is optional but in practice it
is required" comment and changed `const auto validate_indices = false;`
to `constexpr bool validate_indices = false;`. Both were drive-by
churn unrelated to the scalar-indices fix this PR is supposed to add.

The comment in particular is institutional context — without it, the
next reader sees `validate_indices = false` being passed via
AddScalarConstant and reasonably wonders why we bother (the CoreML docs
say it's optional). Restoring it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
# Conflicts:
#	onnxruntime/test/providers/coreml/coreml_basic_test.cc
@maxwbuckley
Copy link
Copy Markdown
Contributor Author

@yuslepukhin thank you again for all your support here 👍

} else {
// gather output here has the data's rank (one more than ONNX scalar-gather output);
// squeeze the inserted axis to recover the original output shape.
std::vector<int64_t> gather_shape = data_shape;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer TensorShapeVector for inline storage.

// For scalar inputs, the input shape is modified from [] -> [1] before passing the input to CoreML.
// This won't work for Gather because the output shape depends on the `indices` input shape which could be a scalar.
// Currently, we expect the CoreML EP to only take the Shape node in this graph (Gather -> Shape).
// The CoreML EP supports scalar 'indices' for Gather only when the 'data' input has a fully
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor gap — rank-1 data with scalar indices: No test for e.g. data shape {6} with scalar index (output is scalar/rank-0). If CoreML's squeeze produces a scalar, this should work, but it's untested. The output-rank check data_shape.size() + indices_shape.size() - 1 > 5 evaluates to 1 + 0 - 1 = 0, which passes, and the rank-4 cap also passes (1 <= 4), so the code would attempt it. Whether CoreML handles a rank-0 squeeze output correctly is the question.

}
}

// Output rank = data_rank + indices_rank - 1. The rank-5 limit applies.
Copy link
Copy Markdown
Member

@yuslepukhin yuslepukhin May 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential underflow in output-rank check (line ~225 in the original): data_shape.size() + indices_shape.size() - 1 — when indices is scalar (indices_shape.size() == 0) and data rank is 1, this is 1 + 0 - 1 = 0, which is fine. But for size_t arithmetic, if somehow both were 0 (can't happen since data rank >= 1 is checked), this would underflow. Not a real bug since data rank >= 1 is enforced by ONNX schema and GetShape succeeds, but worth noting the unsigned arithmetic.
This, however, depends on ONNX enforcement code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Performance][CoreML] GatherOpBuilder rejects rank-0 (scalar) indices, forcing CPU fallback for StyleGAN-family models

2 participants