Sync with Microsoft ONNX Runtime - 04072026 by ai-fw-intg · Pull Request #1184 · intel/onnxruntime

ai-fw-intg · 2026-07-03T20:34:26Z

Automated daily backmerge from ORT main to ovep-develop. No conflicts detected. Do NOT squash or rebase - use merge commit only.

… output indexing (microsoft#29264) ### Summary The CUDA `EmbedLayerNormalization` and `SkipLayerNormalization` kernels compute output write offsets (`row_index * hidden_size`) using 32-bit arithmetic. For very large output tensors the element count can exceed `INT32_MAX`, at which point the offset is no longer representable in 32 bits. Every output write index in these kernels is a pure function of the launch grid and `hidden_size` — there is no data-dependent write indexing — so the maximum index is exactly `output_element_count - 1`, which the host knows from the input shapes before launch. This PR adds a **host-side guard** in each op's `ComputeInternal` that computes the output element count in 64-bit arithmetic and returns a clear error when it exceeds the supported 32-bit indexing range. ### Design - **`EmbedLayerNormalization`** (`embed_layer_norm.cc`): `output_element_count = (int64)batch_size * sequence_length * hidden_size`, guarded with `ORT_RETURN_IF_NOT(... <= INT32_MAX, ...)`. - **`SkipLayerNormalization`** (`skip_layer_norm.cc`): `output_element_count = input->Shape().Size()` (output shares the input shape), same guard. - Kernels are **unchanged** — they keep the original int32 indexing, so there is no extra register/occupancy cost in the hot path. This is pure host-side validation. ### Behavior This **rejects** (rather than silently attempting) single-op LayerNorm outputs larger than 2³¹ elements — a regime no real BERT-family model produces (it would require a multi-GB single-op activation). For all supported shapes there is no behavior or numeric change. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…e inference (microsoft#29268) ### Description The `DecoderAttention` and `MultiHeadAttention` shape-inference functions guarded population of their optional `present_key` (output 1) and `present_value` (output 2) outputs with `getNumOutputs() > 1`, but then write output index 2. `present_key` and `present_value` are produced as a both-or-neither pair, so this requires all three outputs (`> 2`) to be present before populating them — matching the existing `BaseGroupQueryAttention` (`>= 3`) and `EmbedLayerNorm` guards. It also adds an output-index range check in `InferenceContextImpl::getOutputType` so an output index beyond the declared output count fails inference cleanly instead of indexing past the end of the outputs container, mirroring the existing `DataPropagationContextImpl::getOutputType` and `getInputType` behavior. ### Motivation and Context A model that declares fewer outputs than the optional present outputs could previously drive shape inference to access an output index that was not declared. This makes the guard consistent with the other attention-family contrib ops. ### Changes - `onnxruntime/core/graph/contrib_ops/bert_defs.cc` — require all present outputs before populating `present_key`/`present_value` in `DecoderAttention` and `MultiHeadAttention`. - `onnxruntime/core/graph/graph.cc` — add an output-index range check in `InferenceContextImpl::getOutputType`. - `onnxruntime/test/contrib_ops/attention_optional_outputs_shape_inference_test.cc` — regression tests covering omitted optional present outputs, the 3-output positive cases, and the MHA/DMMHA two-output cases. - Adds a contrib-op shape-inference output-index safety skill doc plus a one-line coding-convention note. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…microsoft#29397) This pull request improves the handling of string tensors in the CPU implementation of the Loop operator, ensuring correct memory management and copy semantics for non-trivially-copyable types like `std::string`. It also adds comprehensive unit tests to verify these behaviors, especially for cases involving string scan outputs and loop-carried variables. **Enhancements for string tensor support:** * Updated `ConcatenateCpuOutput` in `loop.cc` to properly detect string tensors and use `std::copy` for concatenation, ensuring correct handling of heap-allocated string payloads and avoiding unsafe byte-wise copying. [[1]](diffhunk://#diff-2c8478657254a53c4ce09684960c925593395336e00c43ef672d9427722e3ff7R276-L282) [[2]](diffhunk://#diff-2c8478657254a53c4ce09684960c925593395336e00c43ef672d9427722e3ff7R289-R304) * Modified `OutputIterator::ZeroOutCurrent` in `scan_utils.h` to skip zeroing for string tensors, as their elements are already default-constructed and cannot be safely set with `memset`. **New and extended unit tests for string tensor scenarios:** * Added tests to `loop_test.cc` covering: - String scan outputs, including long strings that exceed the small-string-optimization threshold and multi-element outputs. - Loop-carried string variables to ensure they are copied correctly. - Zero-iteration cases to confirm empty string scan outputs are handled without errors.

…ssion initialization (microsoft#29250) This pull request introduces improved support and validation for external data in tensor attributes, particularly ensuring that external data in node attributes is properly validated and inlined, and that in-memory references are correctly rejected. Additionally, it introduces new tests to cover these scenarios, and refactors some utility functions in the test code for clarity and consistency. **External Data Handling and Validation:** * Updated `Graph::ConvertInitializersIntoOrtValues()` in `graph.cc` to: * Use `InlinedHashSet` for tracking validated external data paths for efficiency. * Add a new step that validates and inlines external data in node tensor attributes, ensuring that only file-based external data is accepted and in-memory references are rejected. This guarantees all execution providers have uniform access to attribute data. [[1]](diffhunk://#diff-e231a92b40d89409cc8e82436be0a15bc87ef95c93b303b9feaeab6e50c8835cL3842-R3842) [[2]](diffhunk://#diff-e231a92b40d89409cc8e82436be0a15bc87ef95c93b303b9feaeab6e50c8835cL3906-R3971) **Testing:** * Added comprehensive tests in `label_encoder_test.cc` to verify: * Valid external data in tensor attributes is loaded and inlined. * In-memory external data references in node attributes are rejected. * Duplicate key handling and singleton default tensor behavior. * Added a test in `tree_ensembler_test.cc` to ensure that in-memory external data references in node attributes are rejected, preventing invalid attribute usage. **Test Utility Refactoring:** * Refactored utility functions in `tree_ensembler_test.cc` and `treeregressor_test.cc`: * Renamed and standardized helper functions for array and string manipulations to improve code clarity (e.g., `MultiplyUpdateArray`, `MultiplyArraysValues`, `MultiplyUpdateArrayString`). [[1]](diffhunk://#diff-9bce50df70fddc092ce6fc5351812c621343f682a15ca178c34e6dd9415e9a39L47-R49) [[2]](diffhunk://#diff-9bce50df70fddc092ce6fc5351812c621343f682a15ca178c34e6dd9415e9a39L60-R62) [[3]](diffhunk://#diff-9bce50df70fddc092ce6fc5351812c621343f682a15ca178c34e6dd9415e9a39L90-R92) [[4]](diffhunk://#diff-9bce50df70fddc092ce6fc5351812c621343f682a15ca178c34e6dd9415e9a39L116-R127) [[5]](diffhunk://#diff-9bce50df70fddc092ce6fc5351812c621343f682a15ca178c34e6dd9415e9a39L174-R186) [[6]](diffhunk://#diff-9bce50df70fddc092ce6fc5351812c621343f682a15ca178c34e6dd9415e9a39L228-R230) [[7]](diffhunk://#diff-9bce50df70fddc092ce6fc5351812c621343f682a15ca178c34e6dd9415e9a39L243-R245) [[8]](diffhunk://#diff-08b3495816c68f145657ecff63d7b5f3d56813586ec62f7324c22977e70e336bR6-R13) [[9]](diffhunk://#diff-08b3495816c68f145657ecff63d7b5f3d56813586ec62f7324c22977e70e336bL23-R25) **Test File Includes:** * Added necessary includes for file utilities and path handling in test files to support new test scenarios involving external data files. [[1]](diffhunk://#diff-0db6b4a4d9a180daab3cc2eab5da4437a2a7a3a2e81f53119f5e6298bd3dc4a5R7-R8) [[2]](diffhunk://#diff-9bce50df70fddc092ce6fc5351812c621343f682a15ca178c34e6dd9415e9a39R6-R7) [[3]](diffhunk://#diff-08b3495816c68f145657ecff63d7b5f3d56813586ec62f7324c22977e70e336bR6-R13)

…t_address (microsoft#29248) ### Description `vaip::process_ext_address` (in the VitisAI EP) decodes the in-memory "external data" address marker that ORT plants on initializers whose data lives in an in-process buffer. It only matched the little-endian tag `"*/_ORT_MEM_ADDR_/*"` (`kTensorProtoLittleEndianMemoryAddressTag`). This PR makes it also recognize the native-endian tag `"*/_ORT_NATIVE_ENDIAN_MEM_ADDR_/*"` (`kTensorProtoNativeEndianMemoryAddressTag`). ```cpp if (file == "*/_ORT_MEM_ADDR_/*" || file == "*/_ORT_NATIVE_ENDIAN_MEM_ADDR_/*") { auto addr = reinterpret_cast<const char*>(offset); return {addr, size}; } ``` ### Motivation and Context Two existing changes combined to break VitisAI model compilation: 1. microsoft#27404 split the single in-memory marker into a little-endian and a native-endian variant, and switched `TensorToTensorProto(use_tensor_buffer=true)` to emit the **native-endian** tag by default. Both `HasExternalDataInMemory` and the data readers treat the two tags equivalently. 2. microsoft#28709 added an explicit `ORT_ENFORCE(!utils::HasExternalDataInMemory(tensor), ...)` in `Graph::Graph`, rejecting any in-memory address marker found in a deserialized model protobuf. Because `process_ext_address` still only recognized the little-endian tag, in-memory initializers (e.g. quantized weights moved out of the `TensorProto` during graph optimization) are no longer detected in `vaip::model_clone`. They fall through to the verbatim-copy branch and the native-endian marker is carried into the cloned model proto. When ORT then constructs the cloned `Graph`, the new enforce fires and aborts compilation: ``` Initializer 'onnx::Conv_1575_quantized' references an ORT in-memory address marker, which is not allowed in a model protobuf. ``` This regression is observed on 1.27.0, where both changes are present (1.25.x / 1.26.x have the native-endian tag but not the enforce, so the marker leaked silently without crashing). Both tags encode the buffer address in the `offset` field; on little-endian hosts (the only platform this EP targets) the byte layout is identical, so they are decoded the same way. Recognizing both tags restores the previous behavior of converting these initializers into a lightweight reference inside `model_clone`, so the in-memory marker never reaches `Graph::Graph`.

## Summary Make cuDNN an optional runtime dependency for the CUDA Execution Provider and CUDA Plugin EP. The build still uses cuDNN headers, but provider binaries no longer directly depend on cuDNN shared libraries; cuDNN is loaded lazily when enabled and available, while no-cuDNN runs use native CUDA paths where available and report `NOT_IMPLEMENTED` for kernels that still require cuDNN. This also adds CI to validate no-cuDNN runtime environments. ## Key Changes | Area | Changes | |---|---| | CUDA EP runtime loading | Added a dynamic cuDNN loader and cuDNN symbol trampolines so CUDA provider binaries can avoid a direct cuDNN dependency. | | Provider options | Added `enable_cudnn`; removed `cudnn_path` from CUDA EP and CUDA Plugin EP provider configuration. | | CUDA Plugin EP | Wired optional cuDNN behavior through plugin EP config, kernel adapters, stream handles, and plugin utilities. | | Python preload behavior | Updated Python CUDA preload handling so cuDNN remains an optional dependency instead of an unconditional import/runtime requirement. | | Tests | Added/updated provider option coverage and CUDA Plugin EP no-cuDNN mode using `ORT_TEST_CUDA_PLUGIN_NO_CUDNN=1`. | | CI | Added Linux and Windows CUDA no-cuDNN workflows that build with cuDNN headers, exclude cuDNN from the runtime path, verify no direct cuDNN dependency, and run targeted tests. | | Documentation | Added `docs/CUDA_cuDNN_Optional_Design.md` and updated CUDA Plugin EP docs for no-cuDNN behavior and validation. | ## Testing Validated locally on Linux CUDA 13: - Rebuilt CUDA EP / CUDA Plugin EP with cuDNN headers available at build time. - Verified provider binaries have no direct cuDNN dependency: - `readelf -d ... | grep NEEDED | grep -i cudnn || echo "no cudnn DT_NEEDED"` - `ldd ... | grep -i cudnn || echo "no cudnn in ldd"` - Ran CUDA Plugin EP no-cuDNN validation: - `bash .env/cuda_130_plugin_no_cudnn.sh --test_plugin` - Result: `Ran 87 tests`, `OK (skipped=17)` Additional CI coverage is included for Linux and Windows no-cuDNN CUDA validation.

…osoft#28989) ### Description Builds out the standalone `model_package/` C library so a single library covers the full lifecycle of an ONNX Runtime model package: inspection, authoring, content-addressed shared assets, commit, prune, and validation. The library remains free of any dependency on ONNX Runtime itself. **Public C API (`model_package/include/model_package.h`)** - Lifecycle: `ModelPackage_Open` / `ModelPackage_New` / `ModelPackage_Close`, with `ModelPackageOpenOptions` controlling external-path access, symlink following, and strict unknown-field handling. - Inspection: a POD `ModelPackageInfo` tree (`ModelPackage_Info`) plus by-name lookups for components, variants, and per-namespace `executor_info` entries, and round-trip JSON getters that preserve fields unknown to the current build. - Path resolution: `ModelPackage_ResolveStringRef` implements the package's resolution rules — relative paths anchored at a base directory, `sha256:<hex>[/sub/path]` for shared-asset content, and portable vs installed confinement (absolute paths and `..` segments only allowed under `layout: "installed"`). - Shared assets: SHA-256 directory hashing, `ModelPackage_AddSharedAsset` (with reproducible-build URI check and an optional `copy_in` staging mode), `ModelPackage_RemoveSharedAsset`, and `ModelPackage_ResolveAssetUri`. Assets under `<package_root>/shared_assets/` are auto-discovered at `Open`. - Authoring: inline/external component setters, variant upsert/remove, per-namespace executor_info setters (inline and external), and package-level metadata / layout / `additional_metadata` setters. Mutations invalidate cached pointers in the mutated scope and its descendants. - Commit / prune / validate: `ModelPackage_Commit` writes the in-memory model to disk either in place or to a fresh `dest_root` ("save as"), with `PRESERVE` or `DENSE` write modes; `ModelPackage_Prune` reclaims unreferenced files under `shared_assets/` and tracked orphan variant/component directories left behind by removals; and `ModelPackage_Validate` runs schema, path-reachability, asset-rehash, and unknown-field checks and returns a JSON report. - Errors: opaque `ModelPackageStatus*` with a stable additive `ModelPackageErrorCode` enum (IO, schema, version, path confinement, asset missing, asset hash mismatch, not found, invalid arg, state). **Internal layout** The implementation is split into focused translation units: `manifest_parser`, `model_package_impl`, `authoring`, `commit_prune_validate`, `path_resolver`, `asset_hasher`, and an in-tree `sha256`. Shared error plumbing lives in `status_impl.h`. **ONNX Runtime integration (`onnxruntime/core/session/model_package/`)** The ORT-side glue is wired onto the library through the C inspection and path-resolution entry points. `model_package_context` now translates the library's info tree into ORT-internal structs and parses the `executor_info["ort"]` payload (`model_file`, `external_data`, `session_options`, `provider_options`). When a variant declares `external_data`, `CreateSessionForModelPackage` loads the model from a memory mapping and passes the resolved folder to the session via `kOrtSessionOptionsModelExternalInitializersFileFolderPath`, so external initializers — including those backed by a shared asset — are picked up at `Initialize` time. The experimental `OrtModelPackageApi_*_SinceV28` C entries introduced in microsoft#28990 are unchanged. **Documentation and tests** - `model_package/README.md` documents the on-disk layout, manifest and component schemas, shared-asset rules, path resolution, the authoring flow, and commit / prune / validate semantics. - `onnxruntime/core/session/model_package/README.md` documents the ORT consumer-side glue: the `executor_info["ort"]` schema, the variant selection algorithm, the session-creation contract, and the registered experimental C entries. - New library tests cover inspection, authoring, asset hashing, and commit (`model_package/tests/`). The ORT integration tests in `onnxruntime/test/autoep/test_model_package.cc` are reworked against the current C API surface. ### Motivation and Context ORT needs a single library that owns the model package format end to end — not just reading it, but producing it, validating it, and maintaining it on disk with content-addressed shared assets. Consolidating this behind one C API lets ORT, publisher tooling, and third-party consumers share the same parser, path-resolution rules, and on-disk invariants without each reimplementing them, and keeps the library independent of the ORT session runtime. --------- Co-authored-by: jambayk <jambayk@users.noreply.github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: jambayk <jambay@github.com>

…cks (microsoft#29446) This pull request strengthens validation and error handling for the ConvTranspose operator in ONNX Runtime, particularly when using explicit `output_shape` attributes. It adds comprehensive checks to prevent inconsistent or invalid configurations, improves arithmetic safety to guard against integer overflows, and introduces a suite of targeted unit tests to verify these behaviors. **Validation and Error Handling Improvements:** * Added stricter input validation in `ConvTransposeAttributes::ComputePadAndOutputShape` to ensure that all relevant parameters (`output_shape`, input size, stride, kernel, dilation, and output padding) are within valid ranges and to provide clear error messages when they are not. This includes checks that all values are positive and that output padding is non-negative. * Added a consistency check to verify that the explicit `output_shape` is compatible with the input dimensions and convolution parameters, preventing buffer overruns and logical inconsistencies. **Arithmetic Safety:** * Updated `ComputeTotalPad` to use `SafeInt` for all intermediate arithmetic, ensuring that integer overflows are detected and handled safely instead of producing undefined behavior. **Testing Enhancements:** * Added a comprehensive set of unit tests for `ConvTranspose` with explicit `output_shape`, including cases for invalid, inconsistent, and overflow-prone configurations, as well as valid edge cases (e.g., 1D, 2D, and 3D, large batch sizes, group > 1, and cases requiring padding). These tests verify that invalid configurations are rejected and that valid ones work as expected. These changes collectively improve the robustness, correctness, and maintainability of the ConvTranspose operator's implementation and its handling of explicit output shapes.

…osoft#29450) Newer Homebrew versions refuse to load formulae from untrusted third-party taps. This adds `brew trust wix/brew` after tapping to allow the subsequent `brew install applesimutils` to succeed in React Native CI. **Changes:** - `.github/workflows/react_native.yml` - `tools/ci_build/github/azure-pipelines/templates/react-native-ci.yml` Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

### Description This PR introduces minor development patches as listed below. - Fix the OVIR EpCtx model export fixing the changes introduced by microsoft#28725 - Add test for OVIR EpCtx Export on sample model. - Update OV Toolkit version to 2026.2 - Add test for OVEP Workload Type Test. --------- Signed-off-by: Jonathan Clohessy <jonathan.clohessy@arm.com> Signed-off-by: bfilipek <bartlomiej.filipek@intel.com> Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: Christian Bourjau <christian.bourjau@quantco.com> Co-authored-by: Jaswanth Gannamaneni <jaswanth.gannamaneni@intel.com> Co-authored-by: Ryan Metcalfe <107415876+RyanMetcalfeInt8@users.noreply.github.com> Co-authored-by: jatinwadhwa921 <110383850+jatinwadhwa921@users.noreply.github.com> Co-authored-by: derdeljan-msft <derdeljan@microsoft.com> Co-authored-by: Jonathan Clohessy <jonathan.clohessy@arm.com> Co-authored-by: Akshay Sonawane <111780983+apsonawane@users.noreply.github.com> Co-authored-by: Christopher Warrington <chwarr@microsoft.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Ishwar Raut <iraut@nvidia.com> Co-authored-by: Gaurav Garg <gaugarg@nvidia.com> Co-authored-by: Xinpeng Dou <15529241576@163.com> Co-authored-by: Chi Lo <54722500+chilo-ms@users.noreply.github.com> Co-authored-by: adrastogi <aditya.rastogi@microsoft.com> Co-authored-by: Aditya Rastogi <adityar@ntdev.microsoft.com> Co-authored-by: qti-hungjuiw <hungjuiw@qti.qualcomm.com> Co-authored-by: qti-yuduo <yuduow@qti.qualcomm.com> Co-authored-by: Pradeep Sakhamoori <psakhamoori@microsoft.com> Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com> Co-authored-by: Klimenko, Mikhail <mikhail.klimenko@intel.com> Co-authored-by: sfatimar <sahar.fatima@intel.com> Co-authored-by: Garth Long <garth.long@intel.com> Co-authored-by: MayureshV1 <47039074+MayureshV1@users.noreply.github.com> Co-authored-by: Eric Crawford <eric.r.crawford@intel.com> Co-authored-by: Vishnudas Thaniel S <vishnudas.thaniel.s@intel.com> Co-authored-by: Javier Martinez <javier.e.martinez@intel.com> Co-authored-by: Adam Pocock <adam.pocock@oracle.com> Co-authored-by: Changming Sun <chasun@microsoft.com> Co-authored-by: mingyue <131847423+mingyueliuh@users.noreply.github.com> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com> Co-authored-by: Susanta Bhattacharjee <susanta.bhattacharjee@intel.com> Co-authored-by: jatinwadhwa921 <jatin.wadhwa@intel.com> Co-authored-by: Jozef Wludzik <jozef.wludzik@intel.com> Co-authored-by: Bartlomiej Filipek <bartlomiej.filipek@intel.com> Co-authored-by: Kotomi-Du <yaru.du@intel.com> Co-authored-by: Rajeev Sekar <rajeevsekar21@gmail.com> Co-authored-by: liang <gxgaoliang@126.com> Co-authored-by: n1harika <niharika.sathish@intel.com> Co-authored-by: Mayuresh M Varerkar <mayuresh.m.varerkar@intel.com> Co-authored-by: Mikhail Dvoretckii <mikhail.dvoretckii@intel.com> Co-authored-by: bopeng1234 <bo.peng@intel.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: fs-eire <7679871+fs-eire@users.noreply.github.com> Co-authored-by: Wenqin Yang <wenqin.yang@intel.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: xieofxie <xieofxie@126.com> Co-authored-by: hualxie <hualxie@microsoft.com> Co-authored-by: Jiajia Qin <jiajiaqin@microsoft.com> Co-authored-by: Joshua Lochner <admin@xenova.com> Co-authored-by: Christian Bourjau <cbourjau@users.noreply.github.com> Co-authored-by: Xiaofei Han <xiaofeihan@microsoft.com> Co-authored-by: Dmitri Smirnov <yuslepukhin@users.noreply.github.com> Co-authored-by: chunghow-qti <chunghow@qti.qualcomm.com> Co-authored-by: Guenther Schmuelling <guschmue@microsoft.com> Co-authored-by: Jiawei Shao <jiawei.shao@intel.com> Co-authored-by: Tianlei Wu <tlwu@microsoft.com> Co-authored-by: czekun <chen.zekun@intel.com> Co-authored-by: Ryan Metcalfe <ryan.metcalfe@intel.com> Co-authored-by: Jaskaran Singh Nagi <jaskaran.singh.nagi@intel.com> Co-authored-by: ai-fw-intg <sys_ai_fw_intg@intel.com> Co-authored-by: Rajeev Sekar <rajeev.sekar@intel.com> Co-authored-by: RajeevSekar <117911837+RajeevSekar@users.noreply.github.com> Co-authored-by: Nazanin Beheshti <nazanin.beheshti@intel.com>

…29451) ### Description Add a batched GEMV for the small-M range (M = 2..16) of CUDA `MatMulNBits` (4-bit and 8-bit), using the standard `[N, blocks, blob]` weight layout with **no prepacking**. Previously, `MatMulNBits` had a fast single-row GEMV for M = 1 (decode), but for M > 1 it fell back to full weight dequantization + cuBLAS, which dequantizes the entire weight matrix regardless of M. That fallback is therefore flat across small M and dominates latency at the row counts that matter for multi-row decode. The new half/bf16 path uses a `CtaM x CtaN` register-tiled kernel that streams the quantized weight once per block row and reuses each activation load across columns, so latency scales with M. M = 1 decode is unchanged. No prepacking is used, so there is no extra resident weight memory and no GEMM tactic profiling at session init. Also adds `profile_matmul_nbits.py` (same style as `profile_qmoe_gemv.py`, parseable with `parse_nsys.py`) and a docs experiment log. **Before (main: M > 1 dequant + cuBLAS) vs After (batched small-M GEMV)** — A100, block_size 32, fp16, average op latency in microseconds (lower is better). #### 4-bit (M = 2..16) Before: | matrix | K | N | M=1 | M=2 | M=4 | M=8 | M=16 | |---------|-------|--------|-------|--------|--------|--------|--------| | qkv | 4096 | 4096 | 25.9 | 77.2 | 69.3 | 69.5 | 69.9 | | o_proj | 4096 | 4096 | 23.2 | 69.2 | 69.1 | 69.3 | 69.7 | | gate_up | 4096 | 12288 | 39.3 | 172.0 | 172.1 | 172.1 | 172.4 | | down | 12288 | 4096 | 38.8 | 174.8 | 175.1 | 175.4 | 175.6 | | lm_head | 4096 | 151936 | 301.6 | 1868.0 | 1871.1 | 1877.8 | 1885.1 | After: | matrix | K | N | M=1 | M=2 | M=4 | M=8 | M=16 | |---------|-------|--------|-------|-------|-------|-------|--------| | qkv | 4096 | 4096 | 25.4 | 30.1 | 32.3 | 43.2 | 70.0 | | o_proj | 4096 | 4096 | 22.6 | 26.4 | 28.1 | 36.7 | 58.7 | | gate_up | 4096 | 12288 | 37.5 | 43.5 | 49.5 | 71.9 | 125.9 | | down | 12288 | 4096 | 37.9 | 47.4 | 52.9 | 76.8 | 150.1 | | lm_head | 4096 | 151936 | 300.7 | 329.9 | 424.1 | 635.3 | 1226.9 | Speedup (before / after): | matrix | M=2 | M=4 | M=8 | M=16 | |---------|-------|-------|-------|-------| | qkv | 2.56x | 2.15x | 1.61x | 1.00x | | o_proj | 2.62x | 2.46x | 1.89x | 1.19x | | gate_up | 3.95x | 3.48x | 2.39x | 1.37x | | down | 3.69x | 3.31x | 2.28x | 1.17x | | lm_head | 5.66x | 4.41x | 2.96x | 1.54x | #### 8-bit (M = 2..5) 8-bit weights are twice the bytes of 4-bit and the GEMV runs on CUDA cores, so it crosses over to the dequantize + cuBLAS (tensor-core) fallback at a lower M; the batched path covers M = 2..5 and M >= 6 keeps the fallback. Before: | matrix | K | N | M=1 | M=2 | M=3 | M=4 | M=5 | |---------|-------|--------|-------|--------|--------|--------|--------| | qkv | 4096 | 4096 | 36.2 | 80.6 | 72.9 | 72.8 | 73.3 | | o_proj | 4096 | 4096 | 31.1 | 73.2 | 72.7 | 73.6 | 73.4 | | gate_up | 4096 | 12288 | 63.5 | 184.4 | 184.2 | 184.1 | 184.3 | | down | 12288 | 4096 | 67.8 | 187.3 | 187.4 | 187.8 | 188.0 | | lm_head | 4096 | 151936 | 535.9 | 2025.0 | 2025.9 | 2028.1 | 2029.8 | After: | matrix | K | N | M=1 | M=2 | M=3 | M=4 | M=5 | |---------|-------|--------|-------|-------|-------|--------|--------| | qkv | 4096 | 4096 | 36.2 | 46.1 | 48.7 | 57.9 | 67.0 | | o_proj | 4096 | 4096 | 31.6 | 39.5 | 48.6 | 57.9 | 67.1 | | gate_up | 4096 | 12288 | 63.4 | 80.9 | 104.6 | 128.9 | 152.6 | | down | 12288 | 4096 | 68.0 | 96.2 | 119.3 | 146.4 | 172.0 | | lm_head | 4096 | 151936 | 536.3 | 647.0 | 896.9 | 1157.1 | 1420.1 | Speedup (before / after): | matrix | M=2 | M=3 | M=4 | M=5 | |---------|-------|-------|-------|-------| | qkv | 1.75x | 1.50x | 1.26x | 1.09x | | o_proj | 1.85x | 1.50x | 1.27x | 1.09x | | gate_up | 2.28x | 1.76x | 1.43x | 1.21x | | down | 1.95x | 1.57x | 1.28x | 1.09x | | lm_head | 3.13x | 2.26x | 1.75x | 1.43x | ### Motivation and Context The small-M (M = 2..16) regime is exactly what speculative decoding hits when verifying a block of draft tokens: each step runs the target model on a handful of rows rather than a single token. With the previous dispatch, that step paid the full-dequant + cuBLAS cost (flat ~172 us for an MLP projection, ~1.9 ms for the lm_head even at M = 2), which erased much of the speculative-decoding speedup over greedy decoding. Routing these row counts through a batched GEMV that scales with M (2.6-5.7x faster at M = 2, at or above parity through M = 16) restores the benefit, without the resident-memory and session-init costs of a prepacked weight layout. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…pport Gemma4 (microsoft#29236) ## Summary This adds indirect dispatch support for WebGPU flash attention to enable graph capture for Gemma4. When graph capture is on, \`total_seqlen\` is GPU-resident and cannot be read on the CPU, so dispatch group sizes must be computed on GPU. CopyKVCache normally prepares the indirect dispatch buffer as a side effect. For Gemma4 kv_empty layers (shared KV layers that skip CopyKVCache), a dedicated \`PrepareIndirectDispatchProgram\` fills the indirect buffer instead — avoiding \`dispatch(0)\` crashes. - Adds \`PrepareIndirectDispatchProgram\`: a single-thread GPU kernel that reads \`total_sequence_length_input[0]\` and writes 3 x uint32 dispatch dims to \`indirect_buffer\`, matching the logic in \`CopyKVCacheProgram\` - \`use_indirect_dispatch\` is gated on \`context.IsGraphCaptureEnabled()\` — indirect dispatch is only needed when \`total_seqlen\` is GPU-resident; when graph capture is off, CPU-side dispatch sizing works correctly even for kv_empty layers - \`PrepareIndirectDispatchProgram\` takes \`total_sequence_length_input\` (the global max, batch-safe) rather than \`seqlen_k\` (per-batch index 0, unsafe for batch > 1) ## Changes - \`flash_attention.cc\`: - New \`PrepareIndirectDispatchProgram::GenerateShaderCode\`: reads \`total_sequence_length_input[0]\`, computes tile count, calls \`populate_indirect_dispatch_buffer\` — identical logic to \`CopyKVCacheProgram\`'s indirect dispatch block - \`use_indirect_dispatch\` gated on \`seqlen_k != nullptr && total_seqlen != nullptr && context.IsGraphCaptureEnabled()\` - kv_empty path calls \`PrepareIndirectDispatchProgram\` when \`use_indirect_dispatch\` is true (i.e., under graph capture) - \`flash_attention.h\`: Added \`PrepareIndirectDispatchProgram\` class declaration ## Test plan - [x] Verified with Gemma4 INT4 model: ~95-105 tok/s with GC=ON vs ~75 tok/s with GC=OFF - [x] Tested multiple prompts (short/long) with graph capture enabled — no crashes, correct output - [x] Verified warmup (capture run) followed by replay produces consistent results - [x] Unit test: \`WebGPU_SharedKV_IndirectDispatchForGraphCapture\` — exercises the full ORT graph capture simulation path end-to-end: - Model built via the Graph API (opset 17) with all GQA inputs declared as proper graph inputs - All inputs allocated as GPU-resident tensors via \`InferenceSession::GetAllocator\` + \`IOBinding\`; \`total_seqlen\` is a real \`WGPUBuffer\` so \`PrepareIndirectDispatchProgram\` reads it from GPU - WebGPU EP registered with \`kEnableGraphCapture=ON\`; first \`Run\` captures, second \`Run\` replays - Replay output copied GPU→CPU and compared against a CPU reference to verify correctness - Covers the kv_empty branch specifically: \`key\`/\`value\` have \`sequence_length=0\`, triggering the \`PrepareIndirectDispatchProgram\` code path (CopyKVCache is skipped for kv_empty layers) --------- Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>

…dation (microsoft#29462) ### Description Two copy-paste typos in DynamicQuantizeLSTM::Compute ( dynamic_quantize_lstm.cc ) cause the recurrence-weight zero-point and scale to be validated against the wrong tensor's shape: 1. L181: R_zp_shape = w_zp->Shape() → r_zp->Shape() . ZeroPointCheck then iterates w_zp 's element count over the smaller r_zp tensor, reading past it (OOB read). 2. L188: WeightCheck(W_scale_shape, R_scale) → WeightCheck(R_scale_shape, R_scale) . The recurrence scale shape is validated against the input scale shape instead of its own. ### Motivation and Context  Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

### Description Fixes an out-of-bounds (OOB) read in the CPU `GroupQueryAttention` (GQA) operator. During token generation (decode), `ConcatStateChunkGQA` copies `seqlens_k + 1 - sequence_length` rows out of the past key/value buffers. The existing validation only bounded the *present* buffer write (`seqlens_k < present_kv_seqlen`), but the present buffer can be larger than the past buffer when `total_sequence_length` exceeds the past sequence length. A large `seqlens_k` combined with a small past buffer therefore read past the end of the past key/value tensors. ### Motivation and Context `present_kv_seqlen = max(total_sequence_length, past_sequence_length)`, so the pre-existing `seqlens_k < present_kv_seqlen` check does not bound the past-side read when `total_sequence_length > past_sequence_length`. With a crafted `seqlens_k`, the decode path in `ConcatStateChunkGQA` reads `seqlens_k + 1 - sequence_length` rows from the smaller past buffer, causing an OOB read. ### Key Changes | File | Change | |---|---| | `onnxruntime/contrib_ops/cpu/bert/group_query_attention.cc` | Add a per-batch bound in the `seqlens_k` validation block: for the decode case (`past_kv_seqlen > 0 && kv_sequence_length != 0 && !is_first_prompt`), reject when `seqlens_k + 1 - sequence_length > past_kv_seqlen`. | | `onnxruntime/test/contrib_ops/group_query_attention_op_test.cc` | Add regression test `SeqlensKExceedsPastBuffer_OOBRead` exercising a large `seqlens_k` against a small past buffer. | Shared KV (`kv_sequence_length == 0`) appends no new KV and its past read is already bounded by the present-buffer check together with the `total_sequence_length <= seqlen_past_kv_cache` enforcement in the apply-attention paths, so it needs no additional check. ### Testing Notes - Build and run the GQA tests: ``` cmake --build build/Linux/Debug --target onnxruntime_provider_test -j$(nproc) ./build/Linux/Debug/onnxruntime_provider_test --gtest_filter="*GroupQueryAttention*" ``` - All 41 CPU GQA tests pass, including the new `SeqlensKExceedsPastBuffer_OOBRead` regression and the existing shared-KV CPU cases. - `lintrunner` reports no issues on the changed files.

…9448) # PR: Fix negative-axis handling in ExpandDims shape inference ## Description The `com.microsoft.ExpandDims` type/shape-inference function mishandled negative `axis` values, which could lead to an out-of-bounds read during graph resolution (`Graph::Resolve`). This PR corrects the axis normalization and makes the scalar `axis` read robust to both tensor encodings, and adds a regression test that exercises the shape-inference path. ## Summary of Changes ### Fix | File | Change | |------|--------| | `onnxruntime/core/graph/contrib_ops/contrib_defs.cc` | Normalize a negative `axis` against the output rank (`rank + axis + 1`) instead of the off-by-two `rank + axis - 1`, so the insertion index stays within `[0, rank]`. Read the scalar `axis` via the existing `ParseScalar` helper, which handles both `raw_data` and `int32_data` encodings and validates the element count. | ### Test | File | Change | |------|--------| | `onnxruntime/test/contrib_ops/expand_dims_test.cc` | Add `ExpandDimsTest.NegativeAxisConstInitializerShapeInference` plus a `RunExpandDimsConstAxisTest` helper that supplies `axis` as a constant initializer so the operator's shape-inference function is exercised (the existing tests pass `axis` as a runtime input, which skips that path). | ## Details - For an output of `rank + 1` dimensions, a negative `axis` must be normalized as `axis + (rank + 1)`. The previous `rank + axis - 1` formula produced a negative insertion index for the most-negative valid axes, which was then used to index the protobuf dimension list out of bounds. - The axis value was previously read with `int32_data()[0]`. When the value is stored as `raw_data` (the common encoding for serialized models and the one produced by the test harness), `int32_data()` is empty and the access is out of bounds. `ParseScalar` decodes either encoding and validates the count. ## Testing - Built `onnxruntime_provider_test` and ran the ExpandDims suite: `./onnxruntime_provider_test --gtest_filter="ExpandDimsTest.*"` — all 6 tests pass. - Confirmed the new regression test fails (process aborts) without the fix and passes with it. - Existing positive/negative out-of-range and kernel tests are unchanged. ## Checklist - [x] Tests added/updated - [ ] Documentation updated (if applicable) - [x] No breaking changes - [ ] CI passes --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>

… from candidate strings (microsoft#28387) ### Description This PR adds a new EP API, `SelectBestModelCandidate`, that selects the best model variant from a set of candidates described by key-value metadata. The EP evaluates each candidate's metadata against the given hardware device and optional session options, and returns the index of the best match. **Key design points:** - Each candidate is an `OrtKeyValuePairs` representing one model variant. This future-proofs the API — additional metadata keys can be added over time without changing the function signature. - **Single-model variants (simple case):** The KVP contains a single `ep_compatibility_info` key with the compatibility string from the ONNX model metadata. - **Multi-model variants:** When a variant bundles multiple sub-models (e.g., prefill + decode in a GenAI scenario), the KVP uses indexed keys so the EP can inspect each sub-model independently: - `num_models` — number of sub-models (e.g., "3") - `<i>.ep_compatibility_info` — compatibility string for sub-model *i* (required per sub-model) - `<i>.role` — role of sub-model *i*, e.g., "prefill", "decode" (optional) - `<i>.future_meaningful_info` — additional EP-meaningful metadata for sub-model i (optional) A basic EP implementation validates all `<i>.ep_compatibility_info` entries. An advanced implementation can also consider `role` or other metadata for smarter ranking. - This approach delegates variant selection entirely to the EP, which has the domain knowledge to handle structurally mismatched variants (different sub-model counts, different roles, etc.) without ORT needing to understand model roles or compute aggregated scores. ### Motivation and Context The existing `ValidateCompiledModelCompatibilityInfo()` alone is not sufficient for some EPs to determine the best compatible model when there are multiple candidates. For example, an EP may support multiple compilation modes (e.g., "speed optimized" vs "memory optimized") that produce different compatibility strings. The EP can implement this function to evaluate the candidate metadata and select the best compatible variant based on its own criteria and the target device. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

### Description  Use GoogleTest sharding (`GTEST_TOTAL_SHARDS` and `GTEST_SHARD_INDEX`) to split up unit tests into multiple runs in Windows ASan CI build. ### Motivation and Context  AddressSanitizer runs out of memory. Splitting the tests into multiple runs seems to mitigate it. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

titaiwangms and others added 18 commits June 30, 2026 20:40

Merge remote-tracking branch 'origin/master' into sync_msft_04072026

86a927b

ai-fw-intg requested review from Jaswanth51, ankitm3k, jatinwadhwa921 and vthaniel July 3, 2026 20:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sync with Microsoft ONNX Runtime - 04072026#1184

Sync with Microsoft ONNX Runtime - 04072026#1184
ai-fw-intg wants to merge 18 commits into
ovep-developfrom
sync_msft_04072026

ai-fw-intg commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

Uh oh!

Conversation

ai-fw-intg commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants