Skip to content

Sync with Microsoft ONNX Runtime - 04072026#1184

Open
ai-fw-intg wants to merge 18 commits into
ovep-developfrom
sync_msft_04072026
Open

Sync with Microsoft ONNX Runtime - 04072026#1184
ai-fw-intg wants to merge 18 commits into
ovep-developfrom
sync_msft_04072026

Conversation

@ai-fw-intg

Copy link
Copy Markdown

Automated daily backmerge from ORT main to ovep-develop. No conflicts detected. Do NOT squash or rebase - use merge commit only.

titaiwangms and others added 18 commits June 30, 2026 20:40
… output indexing (microsoft#29264)

### Summary

The CUDA `EmbedLayerNormalization` and `SkipLayerNormalization` kernels
compute output write offsets (`row_index * hidden_size`) using 32-bit
arithmetic. For very large output tensors the element count can exceed
`INT32_MAX`, at which point the offset is no longer representable in 32
bits.

Every output write index in these kernels is a pure function of the
launch grid and `hidden_size` — there is no data-dependent write
indexing — so the maximum index is exactly `output_element_count - 1`,
which the host knows from the input shapes before launch. This PR adds a
**host-side guard** in each op's `ComputeInternal` that computes the
output element count in 64-bit arithmetic and returns a clear error when
it exceeds the supported 32-bit indexing range.

### Design

- **`EmbedLayerNormalization`** (`embed_layer_norm.cc`):
`output_element_count = (int64)batch_size * sequence_length *
hidden_size`, guarded with `ORT_RETURN_IF_NOT(... <= INT32_MAX, ...)`.
- **`SkipLayerNormalization`** (`skip_layer_norm.cc`):
`output_element_count = input->Shape().Size()` (output shares the input
shape), same guard.
- Kernels are **unchanged** — they keep the original int32 indexing, so
there is no extra register/occupancy cost in the hot path. This is pure
host-side validation.

### Behavior

This **rejects** (rather than silently attempting) single-op LayerNorm
outputs larger than 2³¹ elements — a regime no real BERT-family model
produces (it would require a multi-GB single-op activation). For all
supported shapes there is no behavior or numeric change.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…e inference (microsoft#29268)

### Description

The `DecoderAttention` and `MultiHeadAttention` shape-inference
functions guarded
population of their optional `present_key` (output 1) and
`present_value` (output 2)
outputs with `getNumOutputs() > 1`, but then write output index 2.
`present_key` and
`present_value` are produced as a both-or-neither pair, so this requires
all three
outputs (`> 2`) to be present before populating them — matching the
existing
`BaseGroupQueryAttention` (`>= 3`) and `EmbedLayerNorm` guards.

It also adds an output-index range check in
`InferenceContextImpl::getOutputType` so an
output index beyond the declared output count fails inference cleanly
instead of
indexing past the end of the outputs container, mirroring the existing
`DataPropagationContextImpl::getOutputType` and `getInputType` behavior.

### Motivation and Context

A model that declares fewer outputs than the optional present outputs
could previously
drive shape inference to access an output index that was not declared.
This makes the
guard consistent with the other attention-family contrib ops.

### Changes

- `onnxruntime/core/graph/contrib_ops/bert_defs.cc` — require all
present outputs before
populating `present_key`/`present_value` in `DecoderAttention` and
`MultiHeadAttention`.
- `onnxruntime/core/graph/graph.cc` — add an output-index range check in
  `InferenceContextImpl::getOutputType`.
-
`onnxruntime/test/contrib_ops/attention_optional_outputs_shape_inference_test.cc`
—
regression tests covering omitted optional present outputs, the 3-output
positive
  cases, and the MHA/DMMHA two-output cases.
- Adds a contrib-op shape-inference output-index safety skill doc plus a
one-line
  coding-convention note.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…microsoft#29397)

This pull request improves the handling of string tensors in the CPU
implementation of the Loop operator, ensuring correct memory management
and copy semantics for non-trivially-copyable types like `std::string`.
It also adds comprehensive unit tests to verify these behaviors,
especially for cases involving string scan outputs and loop-carried
variables.

**Enhancements for string tensor support:**

* Updated `ConcatenateCpuOutput` in `loop.cc` to properly detect string
tensors and use `std::copy` for concatenation, ensuring correct handling
of heap-allocated string payloads and avoiding unsafe byte-wise copying.
[[1]](diffhunk://#diff-2c8478657254a53c4ce09684960c925593395336e00c43ef672d9427722e3ff7R276-L282)
[[2]](diffhunk://#diff-2c8478657254a53c4ce09684960c925593395336e00c43ef672d9427722e3ff7R289-R304)
* Modified `OutputIterator::ZeroOutCurrent` in `scan_utils.h` to skip
zeroing for string tensors, as their elements are already
default-constructed and cannot be safely set with `memset`.

**New and extended unit tests for string tensor scenarios:**

* Added tests to `loop_test.cc` covering:
- String scan outputs, including long strings that exceed the
small-string-optimization threshold and multi-element outputs.
  - Loop-carried string variables to ensure they are copied correctly.
- Zero-iteration cases to confirm empty string scan outputs are handled
without errors.
…ssion initialization (microsoft#29250)

This pull request introduces improved support and validation for
external data in tensor attributes, particularly ensuring that external
data in node attributes is properly validated and inlined, and that
in-memory references are correctly rejected. Additionally, it introduces
new tests to cover these scenarios, and refactors some utility functions
in the test code for clarity and consistency.

**External Data Handling and Validation:**

* Updated `Graph::ConvertInitializersIntoOrtValues()` in `graph.cc` to:
* Use `InlinedHashSet` for tracking validated external data paths for
efficiency.
* Add a new step that validates and inlines external data in node tensor
attributes, ensuring that only file-based external data is accepted and
in-memory references are rejected. This guarantees all execution
providers have uniform access to attribute data.
[[1]](diffhunk://#diff-e231a92b40d89409cc8e82436be0a15bc87ef95c93b303b9feaeab6e50c8835cL3842-R3842)
[[2]](diffhunk://#diff-e231a92b40d89409cc8e82436be0a15bc87ef95c93b303b9feaeab6e50c8835cL3906-R3971)

**Testing:**

* Added comprehensive tests in `label_encoder_test.cc` to verify:
  * Valid external data in tensor attributes is loaded and inlined.
  * In-memory external data references in node attributes are rejected.
  * Duplicate key handling and singleton default tensor behavior.
* Added a test in `tree_ensembler_test.cc` to ensure that in-memory
external data references in node attributes are rejected, preventing
invalid attribute usage.

**Test Utility Refactoring:**

* Refactored utility functions in `tree_ensembler_test.cc` and
`treeregressor_test.cc`:
* Renamed and standardized helper functions for array and string
manipulations to improve code clarity (e.g., `MultiplyUpdateArray`,
`MultiplyArraysValues`, `MultiplyUpdateArrayString`).
[[1]](diffhunk://#diff-9bce50df70fddc092ce6fc5351812c621343f682a15ca178c34e6dd9415e9a39L47-R49)
[[2]](diffhunk://#diff-9bce50df70fddc092ce6fc5351812c621343f682a15ca178c34e6dd9415e9a39L60-R62)
[[3]](diffhunk://#diff-9bce50df70fddc092ce6fc5351812c621343f682a15ca178c34e6dd9415e9a39L90-R92)
[[4]](diffhunk://#diff-9bce50df70fddc092ce6fc5351812c621343f682a15ca178c34e6dd9415e9a39L116-R127)
[[5]](diffhunk://#diff-9bce50df70fddc092ce6fc5351812c621343f682a15ca178c34e6dd9415e9a39L174-R186)
[[6]](diffhunk://#diff-9bce50df70fddc092ce6fc5351812c621343f682a15ca178c34e6dd9415e9a39L228-R230)
[[7]](diffhunk://#diff-9bce50df70fddc092ce6fc5351812c621343f682a15ca178c34e6dd9415e9a39L243-R245)
[[8]](diffhunk://#diff-08b3495816c68f145657ecff63d7b5f3d56813586ec62f7324c22977e70e336bR6-R13)
[[9]](diffhunk://#diff-08b3495816c68f145657ecff63d7b5f3d56813586ec62f7324c22977e70e336bL23-R25)

**Test File Includes:**

* Added necessary includes for file utilities and path handling in test
files to support new test scenarios involving external data files.
[[1]](diffhunk://#diff-0db6b4a4d9a180daab3cc2eab5da4437a2a7a3a2e81f53119f5e6298bd3dc4a5R7-R8)
[[2]](diffhunk://#diff-9bce50df70fddc092ce6fc5351812c621343f682a15ca178c34e6dd9415e9a39R6-R7)
[[3]](diffhunk://#diff-08b3495816c68f145657ecff63d7b5f3d56813586ec62f7324c22977e70e336bR6-R13)
…t_address (microsoft#29248)

### Description

`vaip::process_ext_address` (in the VitisAI EP) decodes the in-memory
"external data" address marker that ORT plants on initializers whose
data lives
in an in-process buffer. It only matched the little-endian tag
`"*/_ORT_MEM_ADDR_/*"` (`kTensorProtoLittleEndianMemoryAddressTag`).
This PR
makes it also recognize the native-endian tag
`"*/_ORT_NATIVE_ENDIAN_MEM_ADDR_/*"`
(`kTensorProtoNativeEndianMemoryAddressTag`).

```cpp
if (file == "*/_ORT_MEM_ADDR_/*" || file == "*/_ORT_NATIVE_ENDIAN_MEM_ADDR_/*") {
  auto addr = reinterpret_cast<const char*>(offset);
  return {addr, size};
}
```

### Motivation and Context

Two existing changes combined to break VitisAI model compilation:

1. microsoft#27404 split the single in-memory marker into a little-endian and a
native-endian variant, and switched
`TensorToTensorProto(use_tensor_buffer=true)`
to emit the **native-endian** tag by default. Both
`HasExternalDataInMemory`
   and the data readers treat the two tags equivalently.

2. microsoft#28709 added an explicit
`ORT_ENFORCE(!utils::HasExternalDataInMemory(tensor), ...)`
   in `Graph::Graph`, rejecting any in-memory address marker found in a
   deserialized model protobuf.

Because `process_ext_address` still only recognized the little-endian
tag,
in-memory initializers (e.g. quantized weights moved out of the
`TensorProto`
during graph optimization) are no longer detected in
`vaip::model_clone`. They
fall through to the verbatim-copy branch and the native-endian marker is
carried
into the cloned model proto. When ORT then constructs the cloned
`Graph`, the
new enforce fires and aborts compilation:

```
Initializer 'onnx::Conv_1575_quantized' references an ORT in-memory address
marker, which is not allowed in a model protobuf.
```

This regression is observed on 1.27.0, where both changes are present
(1.25.x /
1.26.x have the native-endian tag but not the enforce, so the marker
leaked
silently without crashing).

Both tags encode the buffer address in the `offset` field; on
little-endian
hosts (the only platform this EP targets) the byte layout is identical,
so they
are decoded the same way. Recognizing both tags restores the previous
behavior
of converting these initializers into a lightweight reference inside
`model_clone`, so the in-memory marker never reaches `Graph::Graph`.
## Summary

Make cuDNN an optional runtime dependency for the CUDA Execution
Provider and CUDA Plugin EP. The build still uses cuDNN headers, but
provider binaries no longer directly depend on cuDNN shared libraries;
cuDNN is loaded lazily when enabled and available, while no-cuDNN runs
use native CUDA paths where available and report `NOT_IMPLEMENTED` for
kernels that still require cuDNN.

This also adds CI to validate no-cuDNN runtime environments.

## Key Changes

| Area | Changes |
|---|---|
| CUDA EP runtime loading | Added a dynamic cuDNN loader and cuDNN
symbol trampolines so CUDA provider binaries can avoid a direct cuDNN
dependency. |
| Provider options | Added `enable_cudnn`; removed `cudnn_path` from
CUDA EP and CUDA Plugin EP provider configuration. |
| CUDA Plugin EP | Wired optional cuDNN behavior through plugin EP
config, kernel adapters, stream handles, and plugin utilities. |
| Python preload behavior | Updated Python CUDA preload handling so
cuDNN remains an optional dependency instead of an unconditional
import/runtime requirement. |
| Tests | Added/updated provider option coverage and CUDA Plugin EP
no-cuDNN mode using `ORT_TEST_CUDA_PLUGIN_NO_CUDNN=1`. |
| CI | Added Linux and Windows CUDA no-cuDNN workflows that build with
cuDNN headers, exclude cuDNN from the runtime path, verify no direct
cuDNN dependency, and run targeted tests. |
| Documentation | Added `docs/CUDA_cuDNN_Optional_Design.md` and updated
CUDA Plugin EP docs for no-cuDNN behavior and validation. |

## Testing

Validated locally on Linux CUDA 13:

- Rebuilt CUDA EP / CUDA Plugin EP with cuDNN headers available at build
time.
- Verified provider binaries have no direct cuDNN dependency:
- `readelf -d ... | grep NEEDED | grep -i cudnn || echo "no cudnn
DT_NEEDED"`
  - `ldd ... | grep -i cudnn || echo "no cudnn in ldd"`
- Ran CUDA Plugin EP no-cuDNN validation:
  - `bash .env/cuda_130_plugin_no_cudnn.sh --test_plugin`
  - Result: `Ran 87 tests`, `OK (skipped=17)`

Additional CI coverage is included for Linux and Windows no-cuDNN CUDA
validation.
…osoft#28989)

### Description

Builds out the standalone `model_package/` C library so a single library
covers the full lifecycle of an ONNX Runtime model package: inspection,
authoring, content-addressed shared assets, commit, prune, and
validation.
The library remains free of any dependency on ONNX Runtime itself.

**Public C API (`model_package/include/model_package.h`)**

- Lifecycle: `ModelPackage_Open` / `ModelPackage_New` /
`ModelPackage_Close`,
with `ModelPackageOpenOptions` controlling external-path access, symlink
  following, and strict unknown-field handling.
- Inspection: a POD `ModelPackageInfo` tree (`ModelPackage_Info`) plus
  by-name lookups for components, variants, and per-namespace
  `executor_info` entries, and round-trip JSON getters that preserve
  fields unknown to the current build.
- Path resolution: `ModelPackage_ResolveStringRef` implements the
package's
  resolution rules — relative paths anchored at a base directory,
  `sha256:<hex>[/sub/path]` for shared-asset content, and portable vs
  installed confinement (absolute paths and `..` segments only allowed
  under `layout: "installed"`).
- Shared assets: SHA-256 directory hashing,
`ModelPackage_AddSharedAsset`
  (with reproducible-build URI check and an optional `copy_in` staging
  mode), `ModelPackage_RemoveSharedAsset`, and
  `ModelPackage_ResolveAssetUri`. Assets under
  `<package_root>/shared_assets/` are auto-discovered at `Open`.
- Authoring: inline/external component setters, variant upsert/remove,
  per-namespace executor_info setters (inline and external), and
  package-level metadata / layout / `additional_metadata` setters.
  Mutations invalidate cached pointers in the mutated scope and its
  descendants.
- Commit / prune / validate: `ModelPackage_Commit` writes the in-memory
  model to disk either in place or to a fresh `dest_root` ("save as"),
  with `PRESERVE` or `DENSE` write modes; `ModelPackage_Prune` reclaims
  unreferenced files under `shared_assets/` and tracked orphan
  variant/component directories left behind by removals; and
  `ModelPackage_Validate` runs schema, path-reachability, asset-rehash,
  and unknown-field checks and returns a JSON report.
- Errors: opaque `ModelPackageStatus*` with a stable additive
  `ModelPackageErrorCode` enum (IO, schema, version, path confinement,
  asset missing, asset hash mismatch, not found, invalid arg, state).

**Internal layout**

The implementation is split into focused translation units:
`manifest_parser`, `model_package_impl`, `authoring`,
`commit_prune_validate`, `path_resolver`, `asset_hasher`, and an
in-tree `sha256`. Shared error plumbing lives in `status_impl.h`.

**ONNX Runtime integration (`onnxruntime/core/session/model_package/`)**

The ORT-side glue is wired onto the library through the C inspection
and path-resolution entry points. `model_package_context` now translates
the library's info tree into ORT-internal structs and parses the
`executor_info["ort"]` payload (`model_file`, `external_data`,
`session_options`, `provider_options`). When a variant declares
`external_data`, `CreateSessionForModelPackage` loads the model from a
memory mapping and passes the resolved folder to the session via
`kOrtSessionOptionsModelExternalInitializersFileFolderPath`, so external
initializers — including those backed by a shared asset — are picked up
at `Initialize` time.

The experimental `OrtModelPackageApi_*_SinceV28` C entries introduced in
microsoft#28990 are unchanged.

**Documentation and tests**

- `model_package/README.md` documents the on-disk layout, manifest and
  component schemas, shared-asset rules, path resolution, the authoring
  flow, and commit / prune / validate semantics.
- `onnxruntime/core/session/model_package/README.md` documents the ORT
  consumer-side glue: the `executor_info["ort"]` schema, the variant
  selection algorithm, the session-creation contract, and the registered
  experimental C entries.
- New library tests cover inspection, authoring, asset hashing, and
  commit (`model_package/tests/`). The ORT integration tests in
  `onnxruntime/test/autoep/test_model_package.cc` are reworked against
  the current C API surface.

### Motivation and Context

ORT needs a single library that owns the model package format end to
end — not just reading it, but producing it, validating it, and
maintaining it on disk with content-addressed shared assets.
Consolidating this behind one C API lets ORT, publisher tooling, and
third-party consumers share the same parser, path-resolution rules, and
on-disk invariants without each reimplementing them, and keeps the
library independent of the ORT session runtime.

---------

Co-authored-by: jambayk <jambayk@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: jambayk <jambay@github.com>
…cks (microsoft#29446)

This pull request strengthens validation and error handling for the
ConvTranspose operator in ONNX Runtime, particularly when using explicit
`output_shape` attributes. It adds comprehensive checks to prevent
inconsistent or invalid configurations, improves arithmetic safety to
guard against integer overflows, and introduces a suite of targeted unit
tests to verify these behaviors.

**Validation and Error Handling Improvements:**

* Added stricter input validation in
`ConvTransposeAttributes::ComputePadAndOutputShape` to ensure that all
relevant parameters (`output_shape`, input size, stride, kernel,
dilation, and output padding) are within valid ranges and to provide
clear error messages when they are not. This includes checks that all
values are positive and that output padding is non-negative.
* Added a consistency check to verify that the explicit `output_shape`
is compatible with the input dimensions and convolution parameters,
preventing buffer overruns and logical inconsistencies.

**Arithmetic Safety:**

* Updated `ComputeTotalPad` to use `SafeInt` for all intermediate
arithmetic, ensuring that integer overflows are detected and handled
safely instead of producing undefined behavior.

**Testing Enhancements:**

* Added a comprehensive set of unit tests for `ConvTranspose` with
explicit `output_shape`, including cases for invalid, inconsistent, and
overflow-prone configurations, as well as valid edge cases (e.g., 1D,
2D, and 3D, large batch sizes, group > 1, and cases requiring padding).
These tests verify that invalid configurations are rejected and that
valid ones work as expected.

These changes collectively improve the robustness, correctness, and
maintainability of the ConvTranspose operator's implementation and its
handling of explicit output shapes.
…osoft#29450)

Newer Homebrew versions refuse to load formulae from untrusted
third-party taps. This adds `brew trust wix/brew` after tapping to allow
the subsequent `brew install applesimutils` to succeed in React Native
CI.

**Changes:**
- `.github/workflows/react_native.yml`
- `tools/ci_build/github/azure-pipelines/templates/react-native-ci.yml`

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
### Description
This PR introduces minor development patches as listed below.
- Fix the OVIR EpCtx model export fixing the changes introduced by
microsoft#28725
- Add test for OVIR EpCtx Export on sample model.
- Update OV Toolkit version to 2026.2
- Add test for OVEP Workload Type Test.

---------

Signed-off-by: Jonathan Clohessy <jonathan.clohessy@arm.com>
Signed-off-by: bfilipek <bartlomiej.filipek@intel.com>
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Christian Bourjau <christian.bourjau@quantco.com>
Co-authored-by: Jaswanth Gannamaneni <jaswanth.gannamaneni@intel.com>
Co-authored-by: Ryan Metcalfe <107415876+RyanMetcalfeInt8@users.noreply.github.com>
Co-authored-by: jatinwadhwa921 <110383850+jatinwadhwa921@users.noreply.github.com>
Co-authored-by: derdeljan-msft <derdeljan@microsoft.com>
Co-authored-by: Jonathan Clohessy <jonathan.clohessy@arm.com>
Co-authored-by: Akshay Sonawane <111780983+apsonawane@users.noreply.github.com>
Co-authored-by: Christopher Warrington <chwarr@microsoft.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Ishwar Raut <iraut@nvidia.com>
Co-authored-by: Gaurav Garg <gaugarg@nvidia.com>
Co-authored-by: Xinpeng Dou <15529241576@163.com>
Co-authored-by: Chi Lo <54722500+chilo-ms@users.noreply.github.com>
Co-authored-by: adrastogi <aditya.rastogi@microsoft.com>
Co-authored-by: Aditya Rastogi <adityar@ntdev.microsoft.com>
Co-authored-by: qti-hungjuiw <hungjuiw@qti.qualcomm.com>
Co-authored-by: qti-yuduo <yuduow@qti.qualcomm.com>
Co-authored-by: Pradeep Sakhamoori <psakhamoori@microsoft.com>
Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
Co-authored-by: Klimenko, Mikhail <mikhail.klimenko@intel.com>
Co-authored-by: sfatimar <sahar.fatima@intel.com>
Co-authored-by: Garth Long <garth.long@intel.com>
Co-authored-by: MayureshV1 <47039074+MayureshV1@users.noreply.github.com>
Co-authored-by: Eric Crawford <eric.r.crawford@intel.com>
Co-authored-by: Vishnudas Thaniel S <vishnudas.thaniel.s@intel.com>
Co-authored-by: Javier Martinez <javier.e.martinez@intel.com>
Co-authored-by: Adam Pocock <adam.pocock@oracle.com>
Co-authored-by: Changming Sun <chasun@microsoft.com>
Co-authored-by: mingyue <131847423+mingyueliuh@users.noreply.github.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Susanta Bhattacharjee <susanta.bhattacharjee@intel.com>
Co-authored-by: jatinwadhwa921 <jatin.wadhwa@intel.com>
Co-authored-by: Jozef Wludzik <jozef.wludzik@intel.com>
Co-authored-by: Bartlomiej Filipek <bartlomiej.filipek@intel.com>
Co-authored-by: Kotomi-Du <yaru.du@intel.com>
Co-authored-by: Rajeev Sekar <rajeevsekar21@gmail.com>
Co-authored-by: liang <gxgaoliang@126.com>
Co-authored-by: n1harika <niharika.sathish@intel.com>
Co-authored-by: Mayuresh M Varerkar <mayuresh.m.varerkar@intel.com>
Co-authored-by: Mikhail Dvoretckii <mikhail.dvoretckii@intel.com>
Co-authored-by: bopeng1234 <bo.peng@intel.com>
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: fs-eire <7679871+fs-eire@users.noreply.github.com>
Co-authored-by: Wenqin Yang <wenqin.yang@intel.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: xieofxie <xieofxie@126.com>
Co-authored-by: hualxie <hualxie@microsoft.com>
Co-authored-by: Jiajia Qin <jiajiaqin@microsoft.com>
Co-authored-by: Joshua Lochner <admin@xenova.com>
Co-authored-by: Christian Bourjau <cbourjau@users.noreply.github.com>
Co-authored-by: Xiaofei Han <xiaofeihan@microsoft.com>
Co-authored-by: Dmitri Smirnov <yuslepukhin@users.noreply.github.com>
Co-authored-by: chunghow-qti <chunghow@qti.qualcomm.com>
Co-authored-by: Guenther Schmuelling <guschmue@microsoft.com>
Co-authored-by: Jiawei Shao <jiawei.shao@intel.com>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
Co-authored-by: czekun <chen.zekun@intel.com>
Co-authored-by: Ryan Metcalfe <ryan.metcalfe@intel.com>
Co-authored-by: Jaskaran Singh Nagi <jaskaran.singh.nagi@intel.com>
Co-authored-by: ai-fw-intg <sys_ai_fw_intg@intel.com>
Co-authored-by: Rajeev Sekar <rajeev.sekar@intel.com>
Co-authored-by: RajeevSekar <117911837+RajeevSekar@users.noreply.github.com>
Co-authored-by: Nazanin Beheshti <nazanin.beheshti@intel.com>
…29451)

### Description

Add a batched GEMV for the small-M range (M = 2..16) of CUDA
`MatMulNBits`
(4-bit and 8-bit), using the standard `[N, blocks, blob]` weight layout
with
**no prepacking**.

Previously, `MatMulNBits` had a fast single-row GEMV for M = 1 (decode),
but
for M > 1 it fell back to full weight dequantization + cuBLAS, which
dequantizes the entire weight matrix regardless of M. That fallback is
therefore flat across small M and dominates latency at the row counts
that
matter for multi-row decode.

The new half/bf16 path uses a `CtaM x CtaN` register-tiled kernel that
streams
the quantized weight once per block row and reuses each activation load
across
columns, so latency scales with M. M = 1 decode is unchanged. No
prepacking is
used, so there is no extra resident weight memory and no GEMM tactic
profiling
at session init.

Also adds `profile_matmul_nbits.py` (same style as
`profile_qmoe_gemv.py`,
parseable with `parse_nsys.py`) and a docs experiment log.

**Before (main: M > 1 dequant + cuBLAS) vs After (batched small-M
GEMV)** —
A100, block_size 32, fp16, average op latency in microseconds (lower is
better).

#### 4-bit (M = 2..16)

Before:

| matrix  | K     | N      | M=1   | M=2    | M=4    | M=8    | M=16   |
|---------|-------|--------|-------|--------|--------|--------|--------|
| qkv     | 4096  | 4096   | 25.9  | 77.2   | 69.3   | 69.5   | 69.9   |
| o_proj  | 4096  | 4096   | 23.2  | 69.2   | 69.1   | 69.3   | 69.7   |
| gate_up | 4096  | 12288  | 39.3  | 172.0  | 172.1  | 172.1  | 172.4  |
| down    | 12288 | 4096   | 38.8  | 174.8  | 175.1  | 175.4  | 175.6  |
| lm_head | 4096  | 151936 | 301.6 | 1868.0 | 1871.1 | 1877.8 | 1885.1 |

After:

| matrix  | K     | N      | M=1   | M=2   | M=4   | M=8   | M=16   |
|---------|-------|--------|-------|-------|-------|-------|--------|
| qkv     | 4096  | 4096   | 25.4  | 30.1  | 32.3  | 43.2  | 70.0   |
| o_proj  | 4096  | 4096   | 22.6  | 26.4  | 28.1  | 36.7  | 58.7   |
| gate_up | 4096  | 12288  | 37.5  | 43.5  | 49.5  | 71.9  | 125.9  |
| down    | 12288 | 4096   | 37.9  | 47.4  | 52.9  | 76.8  | 150.1  |
| lm_head | 4096  | 151936 | 300.7 | 329.9 | 424.1 | 635.3 | 1226.9 |

Speedup (before / after):

| matrix  | M=2   | M=4   | M=8   | M=16  |
|---------|-------|-------|-------|-------|
| qkv     | 2.56x | 2.15x | 1.61x | 1.00x |
| o_proj  | 2.62x | 2.46x | 1.89x | 1.19x |
| gate_up | 3.95x | 3.48x | 2.39x | 1.37x |
| down    | 3.69x | 3.31x | 2.28x | 1.17x |
| lm_head | 5.66x | 4.41x | 2.96x | 1.54x |

#### 8-bit (M = 2..5)

8-bit weights are twice the bytes of 4-bit and the GEMV runs on CUDA
cores, so
it crosses over to the dequantize + cuBLAS (tensor-core) fallback at a
lower M;
the batched path covers M = 2..5 and M >= 6 keeps the fallback.

Before:

| matrix  | K     | N      | M=1   | M=2    | M=3    | M=4    | M=5    |
|---------|-------|--------|-------|--------|--------|--------|--------|
| qkv     | 4096  | 4096   | 36.2  | 80.6   | 72.9   | 72.8   | 73.3   |
| o_proj  | 4096  | 4096   | 31.1  | 73.2   | 72.7   | 73.6   | 73.4   |
| gate_up | 4096  | 12288  | 63.5  | 184.4  | 184.2  | 184.1  | 184.3  |
| down    | 12288 | 4096   | 67.8  | 187.3  | 187.4  | 187.8  | 188.0  |
| lm_head | 4096  | 151936 | 535.9 | 2025.0 | 2025.9 | 2028.1 | 2029.8 |

After:

| matrix  | K     | N      | M=1   | M=2   | M=3   | M=4    | M=5    |
|---------|-------|--------|-------|-------|-------|--------|--------|
| qkv     | 4096  | 4096   | 36.2  | 46.1  | 48.7  | 57.9   | 67.0   |
| o_proj  | 4096  | 4096   | 31.6  | 39.5  | 48.6  | 57.9   | 67.1   |
| gate_up | 4096  | 12288  | 63.4  | 80.9  | 104.6 | 128.9  | 152.6  |
| down    | 12288 | 4096   | 68.0  | 96.2  | 119.3 | 146.4  | 172.0  |
| lm_head | 4096  | 151936 | 536.3 | 647.0 | 896.9 | 1157.1 | 1420.1 |

Speedup (before / after):

| matrix  | M=2   | M=3   | M=4   | M=5   |
|---------|-------|-------|-------|-------|
| qkv     | 1.75x | 1.50x | 1.26x | 1.09x |
| o_proj  | 1.85x | 1.50x | 1.27x | 1.09x |
| gate_up | 2.28x | 1.76x | 1.43x | 1.21x |
| down    | 1.95x | 1.57x | 1.28x | 1.09x |
| lm_head | 3.13x | 2.26x | 1.75x | 1.43x |

### Motivation and Context

The small-M (M = 2..16) regime is exactly what speculative decoding hits
when
verifying a block of draft tokens: each step runs the target model on a
handful
of rows rather than a single token. With the previous dispatch, that
step paid
the full-dequant + cuBLAS cost (flat ~172 us for an MLP projection, ~1.9
ms for
the lm_head even at M = 2), which erased much of the
speculative-decoding
speedup over greedy decoding. Routing these row counts through a batched
GEMV
that scales with M (2.6-5.7x faster at M = 2, at or above parity through
M = 16)
restores the benefit, without the resident-memory and session-init costs
of a
prepacked weight layout.

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…pport Gemma4 (microsoft#29236)

## Summary
This adds indirect dispatch support for WebGPU flash attention to enable
graph capture for Gemma4. When graph capture is on, \`total_seqlen\` is
GPU-resident and cannot be read on the CPU, so dispatch group sizes must
be computed on GPU.

CopyKVCache normally prepares the indirect dispatch buffer as a side
effect. For Gemma4 kv_empty layers (shared KV layers that skip
CopyKVCache), a dedicated \`PrepareIndirectDispatchProgram\` fills the
indirect buffer instead — avoiding \`dispatch(0)\` crashes.

- Adds \`PrepareIndirectDispatchProgram\`: a single-thread GPU kernel
that reads \`total_sequence_length_input[0]\` and writes 3 x uint32
dispatch dims to \`indirect_buffer\`, matching the logic in
\`CopyKVCacheProgram\`
- \`use_indirect_dispatch\` is gated on
\`context.IsGraphCaptureEnabled()\` — indirect dispatch is only needed
when \`total_seqlen\` is GPU-resident; when graph capture is off,
CPU-side dispatch sizing works correctly even for kv_empty layers
- \`PrepareIndirectDispatchProgram\` takes
\`total_sequence_length_input\` (the global max, batch-safe) rather than
\`seqlen_k\` (per-batch index 0, unsafe for batch > 1)

## Changes
- \`flash_attention.cc\`:
- New \`PrepareIndirectDispatchProgram::GenerateShaderCode\`: reads
\`total_sequence_length_input[0]\`, computes tile count, calls
\`populate_indirect_dispatch_buffer\` — identical logic to
\`CopyKVCacheProgram\`'s indirect dispatch block
- \`use_indirect_dispatch\` gated on \`seqlen_k != nullptr &&
total_seqlen != nullptr && context.IsGraphCaptureEnabled()\`
- kv_empty path calls \`PrepareIndirectDispatchProgram\` when
\`use_indirect_dispatch\` is true (i.e., under graph capture)
- \`flash_attention.h\`: Added \`PrepareIndirectDispatchProgram\` class
declaration

## Test plan
- [x] Verified with Gemma4 INT4 model: ~95-105 tok/s with GC=ON vs ~75
tok/s with GC=OFF
- [x] Tested multiple prompts (short/long) with graph capture enabled —
no crashes, correct output
- [x] Verified warmup (capture run) followed by replay produces
consistent results
- [x] Unit test: \`WebGPU_SharedKV_IndirectDispatchForGraphCapture\` —
exercises the full ORT graph capture simulation path end-to-end:
- Model built via the Graph API (opset 17) with all GQA inputs declared
as proper graph inputs
- All inputs allocated as GPU-resident tensors via
\`InferenceSession::GetAllocator\` + \`IOBinding\`; \`total_seqlen\` is
a real \`WGPUBuffer\` so \`PrepareIndirectDispatchProgram\` reads it
from GPU
- WebGPU EP registered with \`kEnableGraphCapture=ON\`; first \`Run\`
captures, second \`Run\` replays
- Replay output copied GPU→CPU and compared against a CPU reference to
verify correctness
- Covers the kv_empty branch specifically: \`key\`/\`value\` have
\`sequence_length=0\`, triggering the \`PrepareIndirectDispatchProgram\`
code path (CopyKVCache is skipped for kv_empty layers)

---------

Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
…dation (microsoft#29462)

### Description

Two copy-paste typos in  DynamicQuantizeLSTM::Compute 
( dynamic_quantize_lstm.cc ) cause the recurrence-weight zero-point and
scale to be validated against the wrong tensor's shape:

1. L181:  R_zp_shape = w_zp->Shape()  →  r_zp->Shape() .
 ZeroPointCheck  then iterates  w_zp 's element count over the smaller
 r_zp  tensor, reading past it (OOB read).
2. L188:  WeightCheck(W_scale_shape, R_scale)  →
 WeightCheck(R_scale_shape, R_scale) . The recurrence scale shape is
validated against the input scale shape instead of its own.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
### Description

Fixes an out-of-bounds (OOB) read in the CPU `GroupQueryAttention` (GQA)
operator. During token generation (decode), `ConcatStateChunkGQA` copies
`seqlens_k + 1 - sequence_length` rows out of the past key/value
buffers. The existing validation only bounded the *present* buffer write
(`seqlens_k < present_kv_seqlen`), but the present buffer can be larger
than the past buffer when `total_sequence_length` exceeds the past
sequence length. A large `seqlens_k` combined with a small past buffer
therefore read past the end of the past key/value tensors.

### Motivation and Context

`present_kv_seqlen = max(total_sequence_length, past_sequence_length)`,
so the pre-existing `seqlens_k < present_kv_seqlen` check does not bound
the past-side read when `total_sequence_length > past_sequence_length`.
With a crafted `seqlens_k`, the decode path in `ConcatStateChunkGQA`
reads `seqlens_k + 1 - sequence_length` rows from the smaller past
buffer, causing an OOB read.

### Key Changes

| File | Change |
|---|---|
| `onnxruntime/contrib_ops/cpu/bert/group_query_attention.cc` | Add a
per-batch bound in the `seqlens_k` validation block: for the decode case
(`past_kv_seqlen > 0 && kv_sequence_length != 0 && !is_first_prompt`),
reject when `seqlens_k + 1 - sequence_length > past_kv_seqlen`. |
| `onnxruntime/test/contrib_ops/group_query_attention_op_test.cc` | Add
regression test `SeqlensKExceedsPastBuffer_OOBRead` exercising a large
`seqlens_k` against a small past buffer. |

Shared KV (`kv_sequence_length == 0`) appends no new KV and its past
read is already bounded by the present-buffer check together with the
`total_sequence_length <= seqlen_past_kv_cache` enforcement in the
apply-attention paths, so it needs no additional check.

### Testing Notes

- Build and run the GQA tests:
  ```
cmake --build build/Linux/Debug --target onnxruntime_provider_test
-j$(nproc)
./build/Linux/Debug/onnxruntime_provider_test
--gtest_filter="*GroupQueryAttention*"
  ```
- All 41 CPU GQA tests pass, including the new
`SeqlensKExceedsPastBuffer_OOBRead` regression and the existing
shared-KV CPU cases.
- `lintrunner` reports no issues on the changed files.
…9448)

# PR: Fix negative-axis handling in ExpandDims shape inference

## Description

The `com.microsoft.ExpandDims` type/shape-inference function mishandled
negative
`axis` values, which could lead to an out-of-bounds read during graph
resolution
(`Graph::Resolve`). This PR corrects the axis normalization and makes
the scalar
`axis` read robust to both tensor encodings, and adds a regression test
that
exercises the shape-inference path.

## Summary of Changes

### Fix

| File | Change |
|------|--------|
| `onnxruntime/core/graph/contrib_ops/contrib_defs.cc` | Normalize a
negative `axis` against the output rank (`rank + axis + 1`) instead of
the off-by-two `rank + axis - 1`, so the insertion index stays within
`[0, rank]`. Read the scalar `axis` via the existing `ParseScalar`
helper, which handles both `raw_data` and `int32_data` encodings and
validates the element count. |

### Test

| File | Change |
|------|--------|
| `onnxruntime/test/contrib_ops/expand_dims_test.cc` | Add
`ExpandDimsTest.NegativeAxisConstInitializerShapeInference` plus a
`RunExpandDimsConstAxisTest` helper that supplies `axis` as a constant
initializer so the operator's shape-inference function is exercised (the
existing tests pass `axis` as a runtime input, which skips that path). |

## Details

- For an output of `rank + 1` dimensions, a negative `axis` must be
normalized as
`axis + (rank + 1)`. The previous `rank + axis - 1` formula produced a
negative
insertion index for the most-negative valid axes, which was then used to
index
  the protobuf dimension list out of bounds.
- The axis value was previously read with `int32_data()[0]`. When the
value is
stored as `raw_data` (the common encoding for serialized models and the
one
produced by the test harness), `int32_data()` is empty and the access is
out of
  bounds. `ParseScalar` decodes either encoding and validates the count.

## Testing

- Built `onnxruntime_provider_test` and ran the ExpandDims suite:
`./onnxruntime_provider_test --gtest_filter="ExpandDimsTest.*"` — all 6
tests pass.
- Confirmed the new regression test fails (process aborts) without the
fix and
  passes with it.
- Existing positive/negative out-of-range and kernel tests are
unchanged.

## Checklist

- [x] Tests added/updated
- [ ] Documentation updated (if applicable)
- [x] No breaking changes
- [ ] CI passes

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
… from candidate strings (microsoft#28387)

### Description
This PR adds a new EP API, `SelectBestModelCandidate`, that selects the
best model variant from a set of candidates described by key-value
metadata.

The EP evaluates each candidate's metadata against the given hardware
device and optional session options, and returns the index of the best
match.

**Key design points:**

- Each candidate is an `OrtKeyValuePairs` representing one model
variant. This future-proofs the API — additional metadata keys can be
added over time without changing the function signature.

- **Single-model variants (simple case):** The KVP contains a single
`ep_compatibility_info` key with the compatibility string from the ONNX
model metadata.

- **Multi-model variants:** When a variant bundles multiple sub-models
(e.g., prefill + decode in a GenAI scenario), the KVP uses indexed keys
so the EP can inspect each sub-model independently:
  - `num_models` — number of sub-models (e.g., "3")
- `<i>.ep_compatibility_info` — compatibility string for sub-model *i*
(required per sub-model)
- `<i>.role` — role of sub-model *i*, e.g., "prefill", "decode"
(optional)
- `<i>.future_meaningful_info` — additional EP-meaningful metadata for
sub-model i (optional)

A basic EP implementation validates all `<i>.ep_compatibility_info`
entries. An advanced implementation can also consider `role` or other
metadata for smarter ranking.

- This approach delegates variant selection entirely to the EP, which
has the domain knowledge to handle structurally mismatched variants
(different sub-model counts, different roles, etc.) without ORT needing
to understand model roles or compute aggregated scores.

### Motivation and Context
The existing `ValidateCompiledModelCompatibilityInfo()` alone is not
sufficient for some EPs to determine the best compatible model when
there are multiple candidates. For example, an EP may support multiple
compilation modes (e.g., "speed optimized" vs "memory optimized") that
produce different compatibility strings. The EP can implement this
function to evaluate the candidate metadata and select the best
compatible variant based on its own criteria and the target device.

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
### Description
<!-- Describe your changes. -->

Use GoogleTest sharding (`GTEST_TOTAL_SHARDS` and `GTEST_SHARD_INDEX`)
to split up unit tests into multiple runs in Windows ASan CI build.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

AddressSanitizer runs out of memory. Splitting the tests into multiple
runs seems to mitigate it.

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.