Skip to content

[CoreML EP] Add Sin and Cos unary ops#28596

Merged
yuslepukhin merged 3 commits into
microsoft:mainfrom
maxwbuckley:coreml-sin-cos
May 27, 2026
Merged

[CoreML EP] Add Sin and Cos unary ops#28596
yuslepukhin merged 3 commits into
microsoft:mainfrom
maxwbuckley:coreml-sin-cos

Conversation

@maxwbuckley
Copy link
Copy Markdown
Contributor

@maxwbuckley maxwbuckley commented May 20, 2026

Summary

Lower ONNX Sin and Cos to the CoreML ML Program sin / cos elementwise ops
via the existing UnaryOpBuilder, registered in the op builder factory. Like
Erf / Round / Exp, these have no NeuralNetwork lowering
(UnaryFunctionLayerParams has no sin/cos), so IsOpSupportedImpl rejects them on
the NeuralNetwork format.

Why

Sin / Cos form the sinusoidal timestep embedding of diffusion UNets. Supporting
them keeps that prologue on CoreML instead of splitting the graph — a tiny
Stable-Diffusion UNet goes from 2 CoreML partitions → 1, zero graph breaks with
this change alone.

This PR is independent of the rest of the series (it touches only the unary
builder) and can be reviewed/merged in any order.

Tests (coreml_basic_test.cc)

  • SinCos_MLProgram — a Sin + Cos graph runs fully on CoreML and matches the CPU
    reference.
  • SinCosNeuralNetworkNotSupported — the same graph falls back to CPU on the
    NeuralNetwork format.

Doc: coreml_supported_mlprogram_ops.md lists Sin and Cos.

Series — CoreML EP coverage for transformer / diffusion graphs

Together with #28278 (scalar-Gather), the series takes BERT / GPT-2 / ViT /
diffusion-UNet graphs — tiny and full-size — from 2 CoreML partitions to 1, with
zero graph breaks.

Lower ONNX Sin and Cos to the CoreML ML Program `sin` / `cos` elementwise
ops via the existing UnaryOpBuilder, and register them in the op builder
factory. Like Erf/Round/Exp, these have no NeuralNetwork lowering
(UnaryFunctionLayerParams has no sin/cos), so IsOpSupportedImpl rejects
them on the NeuralNetwork format.

These appear in the timestep (sinusoidal position) embedding of diffusion
UNets; supporting them lets that prologue stay on CoreML instead of
splitting the graph into separate partitions.

Tests (coreml_basic_test.cc):
- SinCos_MLProgram: a Sin+Cos graph runs fully on CoreML and matches the
  CPU reference.
- SinCosNeuralNetworkNotSupported: the same graph falls back to CPU on the
  NeuralNetwork format.

Doc: coreml_supported_mlprogram_ops.md lists Sin and Cos.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
# Conflicts:
#	onnxruntime/core/providers/coreml/builders/impl/unary_op_builder.cc
#	onnxruntime/core/providers/coreml/builders/op_builder_factory.cc
@maxwbuckley maxwbuckley marked this pull request as ready for review May 22, 2026 06:24
@maxwbuckley
Copy link
Copy Markdown
Contributor Author

maxwbuckley commented May 22, 2026

@yuslepukhin Continuing the great work on making Mac ML on Onnxruntime amazing! Thank you :)

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds CoreML Execution Provider support for ONNX Sin and Cos by lowering them to CoreML ML Program elementwise sin/cos operations via the existing UnaryOpBuilder, with explicit fallback on the NeuralNetwork format where no equivalent unary layer exists.

Changes:

  • Register Sin and Cos in the CoreML op builder factory as unary ops.
  • Extend UnaryOpBuilder ML Program lowering to emit sin and cos, and explicitly reject Sin/Cos for the NeuralNetwork format.
  • Add CoreML EP tests validating ML Program execution and NeuralNetwork fallback, and document the newly supported ops.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
tools/ci_build/github/apple/coreml_supported_mlprogram_ops.md Documents ML Program support for Sin and Cos.
onnxruntime/test/providers/coreml/coreml_basic_test.cc Adds ML Program correctness test for Sin/Cos and a NeuralNetwork-format fallback test.
onnxruntime/core/providers/coreml/builders/op_builder_factory.cc Registers Sin and Cos with the unary op builder.
onnxruntime/core/providers/coreml/builders/impl/unary_op_builder.cc Implements ML Program lowering for Sin/Cos and gates them off for NeuralNetwork format support checks.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

yuslepukhin
yuslepukhin previously approved these changes May 26, 2026
Copy link
Copy Markdown
Member

@yuslepukhin yuslepukhin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Resolves conflict in coreml_basic_test.cc by keeping both the new
Sin/Cos tests (SinCos_MLProgram and SinCosNeuralNetworkNotSupported)
and the upstream Gather test additions.
Copy link
Copy Markdown
Member

@yuslepukhin yuslepukhin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@yuslepukhin yuslepukhin enabled auto-merge (squash) May 27, 2026 17:29
@yuslepukhin yuslepukhin merged commit 71cfbb0 into microsoft:main May 27, 2026
92 of 93 checks passed
yuslepukhin pushed a commit that referenced this pull request Jun 2, 2026
### Summary

Two changes to the ML Program `Cast` builder:

1. **Accept `BOOL` as a source and target dtype** in
`HasSupportedInputsImpl`. The
ML Program `cast` op already handles bool, and `AddToModelBuilderImpl`
already
   maps `to == BOOL`; only the input/output type gate omitted it.
2. **Move the "no preceding node" check after the ML Program
early-return.** That
   check is legacy gating for the NeuralNetwork ArgMax-only path (which
dereferences `InputEdgesBegin()`); on the ML Program path a `Cast` fed
directly
by a graph input is fine, and rejecting it forced needless CPU fallback.

### Why

This is the first of a **4-PR series** giving the CoreML EP the op
coverage to run
transformer and diffusion graphs as a *single CoreML partition* instead
of
fragmenting across CPU.

Transformer attention-mask graphs are a `Cast → GatherND → And → Where`
chain over
**bool** tensors. A CoreML partition cannot have a bool input/output
(CoreML
`MLMultiArray` has no bool type), so bool must stay *internal* — which
makes `Cast`
(the int↔bool boundary) the prerequisite for the rest of the series.

### Combined impact of the series

With all four PRs plus #28278 (scalar-`Gather`), every model below goes
from 2
CoreML partitions to **1, with zero graph breaks** — the whole graph
runs on
CoreML. Measured on an Apple M3 Max, ML Program format:

| Model | partitions (before → after) | CoreML vs CPU |
|-------|:---------------------------:|--------------:|
| BERT-large (340M)   | 2 → 1 | 7.3× (fp32) / 11.0× (fp16) |
| ViT-large (304M)    | 2 → 1 | 8.5× (fp32) / 10.3× (fp16) |
| GPT-2-large (774M)  | 2 → 1 | 11.4× (fp16) |
| SD-1.5 UNet (860M)  | 2 → 1 | 9.7× (fp16)  |

The op builders eliminate the graph breaks (deterministic); the speedups
are what
CoreML already delivers once a model is no longer fragmented.

### Tests (`coreml_basic_test.cc`)

- `CastNonArgMaxNeuralNetworkNotSupported` — an `int64 → bool → float`
cast chain
falls back to CPU on the NeuralNetwork format, guarding the
`IsOpSupportedImpl`
  reordering.

Positive `bool`-Cast coverage is in the dependent PRs: `Cast → GatherND
→ Cast`
(#28598's `GatherNDBoolData_MLProgram`) and `Cast → And → Cast`
(#28597's
`And_MLProgram`). Both place a non-`Cast` op between the int↔bool casts
and check
the result against the CPU EP. A *standalone* `int64 → Cast(bool) →
Cast(float)`
round-trip can't be verified here — CoreML's compiler fuses back-to-back
`cast`
ops and drops the bool clamp — so the pattern needs that intervening op,
which
only the dependent PRs provide.

### Series — CoreML EP coverage for transformer / diffusion graphs

- **#28595 — Support bool Cast in ML Program** *(this PR —
prerequisite)*
- #28596 — Add Sin and Cos unary ops *(independent)*
- #28597 — Add Where and And builders *(depends on #28595)*
- #28598 — Add GatherND builder *(depends on #28595)*

Together with #28278 (scalar-`Gather`), the series takes BERT / GPT-2 /
ViT /
diffusion-UNet graphs — tiny and full-size — from 2 CoreML partitions to
1, with
zero graph breaks.

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
yuslepukhin pushed a commit that referenced this pull request Jun 2, 2026
### Summary

New ML Program op builder: ONNX `GatherND` → CoreML `gather_nd`.

- `batch_dims` must be 0 — the iOS15 `gather_nd` op has no `batch_dims`
parameter;
  `IsOpSupportedImpl` rejects other values.
- CoreML's `gather_nd` rejects a **bool `x`**, but transformer
attention-mask
graphs gather from bool tensors. For bool data the builder lowers the op
as
`cast(bool→int32) → gather_nd → cast(int32→bool)`; int32 represents 0/1
exactly,
  so the round-trip is lossless.
- `validate_indices` is passed explicitly — the ML Program parser
rejects
`gather_nd` without it (the same quirk the `gather` builder works
around).
- ML-Program-only; `IsOpSupportedImpl` rejects the NeuralNetwork format.

### Indices handling (CoreML `gather_nd` quirks)

Two CoreML behaviours that differ from ONNX are handled in the builder:

- **`indices` must be a constant initializer.** CoreML's `gather_nd`
miscomputes
the result for some data/indices shape combinations when `indices` is a
runtime
(non-constant) input — it returns slice 0 regardless of the actual index
value.
With a constant `indices` it is correct, so non-constant cases fall back
to CPU.
Constant indices is also the common case (e.g. transformer attention
masks).
- **Negative indices are normalized at build time.** ONNX `GatherND`
wraps a
negative index by the corresponding data dim; CoreML's `gather_nd` does
not and
silently returns wrong values. Since `indices` is constant, the builder
wraps any
negatives into positive int32 indices while building the model (and
requires the
indexed data dims to be static, otherwise the node falls back to CPU).
This was
surfaced by fuzzing over randomized shapes/indices and verified
on-device
(negative indices, scalar outputs, ranks 2–4) against the CPU reference.

### Depends on the bool-Cast PR

The bool-data `GatherND` test needs `Cast` as the `int ↔ bool`
producer/consumer so
the bool tensors stay internal to the CoreML partition (a partition
cannot have
bool I/O). This branch is **stacked on `coreml-cast-bool`** — the
`cb43b7c75f`
commit in this PR is the bool-Cast PR and drops from this diff once that
one
merges.

### Tests (`coreml_basic_test.cc`)

- `GatherND_MLProgram` — a float `GatherND` runs on CoreML, matches CPU.
- `GatherNDBoolData_MLProgram` — a `Cast → GatherND → Cast` bool chain
runs fully
  on CoreML, exercising the cast round-trip lowering.
- `GatherNDNeuralNetworkNotSupported` — `GatherND` falls back on the
NeuralNetwork
  format.
- `GatherNDBatchDimsNotSupported` — `GatherND` with `batch_dims=1` falls
back to CPU.

Doc: `coreml_supported_mlprogram_ops.md` lists `GatherND`.

### Series — CoreML EP coverage for transformer / diffusion graphs

- #28595 — Support bool Cast in ML Program *(prerequisite)*
- #28596 — Add Sin and Cos unary ops *(independent)*
- #28597 — Add Where and And builders *(depends on #28595)*
- **#28598 — Add GatherND builder** *(this PR — depends on #28595)*

Together with #28278 (scalar-`Gather`), the series takes BERT / GPT-2 /
ViT /
diffusion-UNet graphs — tiny and full-size — from 2 CoreML partitions to
1, with
zero graph breaks.

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants