Sync with Microsoft ONNX Runtime - 24062026 by ai-fw-intg · Pull Request #1159 · intel/onnxruntime

ai-fw-intg · 2026-06-23T20:34:15Z

Automated daily backmerge from ORT main to ovep-develop. No conflicts detected. Do NOT squash or rebase - use merge commit only.

### Description  - Split out C++ code into separate C++ experimental header `onnxruntime_experimental_cxx_api.h`. This can also contain other auxiliary experimental API-related C++ code. - Add throwing C++ experimental function accessors. The `Get_X_FnOrThrow()` variant throws an exception if the experimental API is unavailable in the build. ### Motivation and Context  Improve experimental API ergonomics for C++. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…oft#29218) ### Description The `GroupQueryAttentionTest.BatchedRightPaddedRotaryPrefill_CUDA` test (added in microsoft#29002) fed **fp32** inputs via `AddInput<float>`. The CUDA (and WebGPU) GroupQueryAttention kernels only register for `MLFloat16`/`BFloat16`, so the fp32 node silently fell back to the **CPU EP** — the `_CUDA` test never actually exercised the CUDA kernel it is named for. This surfaced as a CI failure on the CUDA test leg after microsoft#29002 and microsoft#29046 merged. This PR makes `RunGQAPackedQKVRotaryPrefill` feed **fp16** tensors when targeting CUDA EP, matching the existing `RunGQASharedKVFp16` convention and the test's own "loose enough for fp16 rounding" tolerance. The CPU code path is unchanged. ### Key Changes - `RunGQAPackedQKVRotaryPrefill` now branches on the target EP: - CUDA EP: inputs/outputs use `MLFloat16` (converted via `ToFloat16`), so the node is placed on the real GPU kernel. - WebGPU/CPU EP: unchanged (`float`). - Output is converted back to `float` for the existing comparison logic. ### Testing - `onnxruntime_provider_test --gtest_filter='GroupQueryAttentionTest.BatchedRightPaddedRotaryPrefill_CUDA'` → **PASSED** (now runs on the CUDA fp16 kernel). - Full `GroupQueryAttentionTest.*` suite → 47 passed, WebGPU-only tests skipped locally (no WebGPU EP), no regressions. ### Motivation and Context Restores genuine CUDA kernel coverage for the right-padded rotary prefill scenario and fixes the CI failure. Related: microsoft#29002, microsoft#29046.

### Description PR that introduced issue: microsoft#29064 Fixes in this PR: 1) Add relevant platform guard in some tests that was previously missing 2) Added the new AVX512 headers that host the kernels to their right location within the cmake file grouping - previously they were placed in AVX2 grouping (ultimately the TU that included those headers were compiled with AVX512 flags - so no harm was done). This fix is more pedantic than fixing a real issue. The lone .cpp file in that list didn't include any intrinsics manually but the compiler might use AVX512 now for auto-vectorization with the shuffling. Since that file contains only the pre-packing functions that are used in production, it is safe. The "scalar" kernel implementation in that file is mostly a test oracle - nothing else Sample failed run (before PR): https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=1268723&view=results Sample successful run (with PR): https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=1273196&view=results ### Motivation and Context Fix Python packaging pipeline

## Description Adds CUDA support for GroupQueryAttention QK-Norm by applying per-head Q/K RMSNorm before RoPE in the fused preprocess path. It also enables the pre-norm graph fusion for CUDA and allows non-quantized QK-Norm decode to use XQA, restoring the fast global decode path for GPT-OSS/Qwen-style shapes while keeping quantized-cache QK-Norm on the existing fallback path until scale handling is validated. ## Summary of Changes ### CUDA GroupQueryAttention - Threads q_norm_weight / k_norm_weight and qk_norm_epsilon through CUDA GQA data/parameters. - Applies FP32 per-head RMSNorm to Q/K in UnpackRoPEAppend before RoPE and KV append. - Adds shared-KV Q-only normalization support. - Enables non-quantized QK-Norm decode to route through XQA after the fused preprocess normalizes Q/K. - Keeps quantized-cache QK-Norm decode gated off XQA pending normalized-K scale validation. ### Fusion and Schemas - Enables GroupQueryAttentionPreNormFusion for CUDA and native WebGPU. - Updates contrib operator schema text and generated ContribOperators.md to document CUDA/native WebGPU QK-Norm support. - Updates CPU/JSEP rejection text for unsupported providers. ### Tests, Docs, and Profiling - Adds CUDA optimizer coverage for the pre-norm fusion. - Adds Python GQA QK-Norm parity coverage, including explicit FP16/BF16 XQA decode tests. - Extends GQA profiling helpers with QK-Norm options and documents CUDA GQA behavior in docs/contrib_ops/cuda/gqa.md. ## Testing - Built: `ninja onnxruntime_providers_cuda onnxruntime_test_all` in `build/cu130/Release`. - Ran: `./onnxruntime_test_all --gtest_filter="GraphTransformationTests.GroupQueryAttentionPreNormFusion*"` (11 passed, 2 WebGPU skips). - Ran: `python -m pytest test_gqa.py::TestGQAQKNorm::test_gqa_qk_norm_past_xqa test_gqa.py::TestGQAQKNorm::test_gqa_qk_norm_past_xqa_bf16 -q` (2 passed). - Ran: `python -m pytest test_gqa.py -k "QKNorm" -q` (38 passed). - Ran: `git diff --check`. - Verified routing with `ORT_ENABLE_XQA=1 ORT_ENABLE_ATTENTION_KERNEL_DEBUG_INFO=1`: FP16 and BF16 QK-Norm decode report `SdpaKernel=XQA`. - Profiled GPT-OSS-like packed FP16 shape (`B=1,S=1,past=2048,N=64,Nkv=8,H=64,head_sink,QK-Norm`) with nsys: `H64::grp8_fp16::kernel_mha` averaged ~8.21 us and `UnpackRoPEAppend<half, half, 16, 64>` averaged ~2.94 us. ## Checklist - [x] Tests added/updated - [x] Documentation updated - [x] No breaking changes - [x] CI passes --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>

…mbeddings (microsoft#29069) ### Description Fixes NaN output in the CPU GQA kernel when running batched right-padded prefill. For padding token positions where `seq_causal_length > total_seqlen`, the softmax loop was reading beyond the GEMM-filled region of the attention probs buffer into uninitialized memory, producing NaN values that propagated through the V GEMM to the output. **Root cause:** In `ComputeAttentionProbs`, `seq_causal_length = causal_past_seqlen + seq + 1` grows with each query position. For right-padded batches, a batch item with `real_len < sequence_length` has `total_seqlen = real_len`, but padding positions still iterate up to `sequence_length`, giving `seq_causal_length > total_seqlen`. The QK GEMM only fills columns `[0, total_seqlen)` — positions beyond that are uninitialized. **Fix:** Cap the effective causal length at `total_seqlen` before computing the softmax window: ```cpp // gqa_attention_base.h - both float and quantized paths const size_t effective_causal_length = std::min(seq_causal_length, total_seqlen); // use effective_causal_length for: local window check, start_offset, window_size, masking loops ``` Applied to both the non-quantized float path (~line 1097) and the quantized MLAS path (~line 436). ### Motivation and Context The new test `GroupQueryAttentionTest.BatchedRightPaddedRotaryPrefill_CPU` (added in this PR) exercises batched GQA with heterogeneous real sequence lengths `{4, 2, 6}` padded to `sequence_length=6`. Batch item 1 (`real_len=2`) has padding tokens at positions 2–5; position 3 triggered the NaN via uninitialized attention probs memory. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: Jiajia Qin <jiajiaqin@microsoft.com>

### Description ONNX added a `MaxPool-22` schema, but the DropQDQ selectors still matched only `MaxPool-12`, so the `DequantizeLinear` / `QuantizeLinear` around an opset-22 `MaxPool` were no longer dropped. This updates both DropQDQ selector op-version maps to `{12, 22}`, matching the QDQ propagation pass. (`MaxPool` stays pinned to integer-capable versions ≥ 12.) ### Motivation and Context Fixes microsoft#28770. Models that optimized to full integer at opset ≤ 21 regressed at opset 22+, blocking users from upgrading their opset.

…x output in data propagation (microsoft#29084) ## Summary A spec-valid `Shape → Gather(1-D index [-1]) → TopK` model fails to load since ORT 1.25.0 with: ``` K input must be a one-dimensional tensor of size 1. ``` The model is valid: a rank-1 (single-element) Gather index produces a rank-1 Gather output, so the value feeding TopK's `K` input is a 1-D size-1 tensor — exactly what TopK requires. The failure was an **ORT rank-preservation bug in shape-inference data propagation**, not a problem with the model. **Root cause.** `GatherOpDataPropagation::infer()` routed by element **count** rather than index **rank**: it guarded on `indices.size() == 1`, which is true for *both* a 0-D scalar index and a 1-D single-element index, and then unconditionally called `SetInferredShapeScalarValue()`. That dropped the rank of the spec-valid 1-D size-1 case, so `Graph::getInputData()` emitted a 0-D (dimensionless) propagated value. ONNX TopK shape inference then correctly rejected the 0-D `K`. This path was introduced by microsoft#26269 (partial data propagation to enhance shape inference). This reproduces even at `GraphOptimizationLevel.ORT_DISABLE_ALL`, where constant folding never runs — confirming the cause is data propagation in shape inference, **not** constant folding (microsoft#26345 was an earlier mis-attribution; see the corrected analysis). Fixes the regression reported in microsoft#29072. Corrected root-cause analysis: microsoft#29072 (comment) ## The fix - **Gather — rank-based routing.** Distinguish the index rank instead of its element count. A genuine 0-D scalar index still stores a scalar value; a rank-1 single-element index now stores a **rank-1** value, so `getInputData()` emits a `TensorProto` with `dims=[1]` and downstream TopK sees a valid 1-D size-1 `K`. The index rank is taken from the same constant initializer the index value comes from (via `get_initialized_input_values` now reporting the initializer rank), rather than a second, independently-resolved `NodeArg` shape — removing a potential source-of-truth drift (EDGE #2). - **Rank-tolerant elementwise companion (Add/Sub/Mul/Div).** These ops were scalar-only and would silently stop propagating once an operand became a rank-1 value (e.g. a `Shape → Gather(1-D idx) → Mul → TopK` chain), because the custom-propagation result replaces ONNX's rank-correct fallback. They now accept a single element carried as either a rank-0 scalar or a rank-1 `[1]` value and keep the output rank consistent with ONNX broadcasting (rank-1 if any operand is rank-1, else scalar), so such chains keep propagating end-to-end. Div additionally guards against division by zero. - **Shared helper (`data_propagation_value_utils.h`).** Centralizes reading/writing a single-element shape value while preserving its rank, used by both the Gather producer and the elementwise consumers so they cannot disagree on rank. The reader **declines** a rank-1 multi-element value (it must never collapse to `element[0]`), so a multi-element value can never be mistaken for a single one. ## Testing Five `ShapeInferenceV2Test` cases (with fixtures + generators), all loading the model at **every** optimization level (including `ORT_DISABLE_ALL`): - `GatherToTopKRankPreservationTest` — the core `Shape → Gather([-1]) → TopK` regression; asserts the rank-1 `K` is preserved. - `GatherMulToTopKRankPreservationTest` — the `… → Gather(1-D idx) → Mul → TopK` chain; asserts propagation survives the elementwise op. - `SinglePropagatedShapeValueGuardTest` — a direct unit test pinning the shared reader's behavior on each channel (scalar, rank-1 single-element, rank-1 multi-element, symbolic, empty). **Mutation-proven**: relaxing the `dim_size()==1` guard makes this test fail, restoring it makes it pass — so the guard the whole fix hinges on is test-locked. - `ShapeMulMultiElementNoScalarCollapseTest` — end-to-end check that a multi-element `Shape → Mul → ConstantOfShape` chain still resolves to its full rank-2 shape (no bogus scalar collapse). - `PartialDataPropagationTest` — pre-existing scalar-index coverage, unchanged. Full `onnxruntime_test_all` suite passes (0 failures) on top of the current `main` (opset-27 / ONNX 1.22.0 integration). The constant-folding memory path (microsoft#26345) is untouched — the diff is confined to `data_propagation/`, a small `graph.cc` change, and tests. ## Follow-ups (intentionally out of scope for this PR) - Hardening for a rank ≥ 2 single-element index (e.g. shape `[1,1]`) to *decline* rather than route as rank-1 — needs its own discriminating unit test; pathological/non-exporter, worst case is degraded inference rather than a crash. - Explicit end-to-end coverage for Add/Sub/Div rank-1 chains (the shared-reader unit test already covers the read path for all four ops; only Mul is currently exercised end-to-end). - Minor readability nits. ## DCO Commit is DCO signed-off. --------- Signed-off-by: titaiwangms <titaiwang@microsoft.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

… empty (microsoft#28733) ### Description In `PadNodeGroupSelector::Check` , instead of calling `CheckQDQNodes` with : ```cpp int num_dq_inputs = static_cast<int>(dq_nodes.size()); // ... CheckQDQNodes(graph_viewer, node, redundant_clip_node, dq_nodes, q_nodes, num_dq_inputs) ``` it should be called with : ```cpp CheckQDQNodes(graph_viewer, node, redundant_clip_node, dq_nodes, q_nodes) ``` because otherwise, nothing checks for the case where `dq_nodes` is empty. ### Motivation and Context See issue microsoft#28717 for reference. When checking Pad node for fusion with Q-DQ nodes in `PadNodeGroupSelector::Check`, if there is no upstream DQ node, [onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selectors.cc]():900 will provoke a segmentation fault : ```cpp const int32_t dt_input_1 = dq_nodes[0]->InputDefs()[0]->TypeAsProto()->tensor_type().elem_type(); ``` because there's nothing that checks against `dq_nodes` emptiness beforehand.

This pull request refactors the `CreateGetVectorOfMapsStringFloat` test in `test_nontensor_types.cc` to improve type safety and clarity by switching from `std::vector` to `std::array` for fixed-size data and updating tensor creation logic. The changes also fix the string comparison to use the correct array. **Test improvements and type safety:** * Replaced `std::vector` with `std::array` for `dims` and `values` variables, making the code safer and more efficient for fixed-size data. * Updated the creation of the string tensor by using `Ort::AllocatorWithDefaultOptions()` and filling the tensor with `FillStringTensor`, aligning with best practices for string tensor creation. **Test correctness:** * Fixed the assertion to compare against `keys_arr` instead of the undefined `keys` vector, ensuring the test checks the correct set of keys. **Header inclusion:** * Added `#include <array>` at the top of the file to support the use of `std::array`.

### Description The `coreml_proto` target must be properly exported to `onnxruntimeTargets`. Its include directories must not contain any paths from the build tree during the install phase. ### Motivation and Context This change fixes a build failure on macOS when attempting to create a static library of ONNX Runtime with the CoreML Execution Provider (EP) using the following command: ```sh ./build.sh --use_coreml --cmake_extra_defines CMAKE_POLICY_VERSION_MINIMUM=3.5 ``` **Note**: `CMAKE_POLICY_VERSION_MINIMUM=3.5` is required for modern macOS environment. The current version of CMake available on Homebrew is `4.3.1`, and it won't allow `cmake_minimum_required(3.5)`. It seems that ignoring `cmake_minimum_required(3.5)` in the ONNX Runtime tree is not harmful. The build fails because `coreml_proto` has no export sets specified and it violates CMake's requirement that exported targets must not reference paths from the build tree when installed. Currently, `coreml_proto` depends on `${CMAKE_CURRENT_BINARY_DIR}` to locate generated Protobuf definitions. I suppose these definitions are required only during the build phase and not needed for the installation. This change set resolves it by: - Exporting `coreml_proto` to `${PROJECT_NAME}Targets`. - Replacing `${CMAKE_CURRENT_BINARY_DIR}` with `$<BUILD_INTERFACE:${CMAKE_CURRENT_BINARY_DIR}>`. Signed-off-by: Kaito Udagawa <umireon@kaito.tokyo>

LGTM

hdharpure9922

LGTM

Sync msft 24062026

edgchen1 and others added 11 commits June 22, 2026 13:56

Merge remote-tracking branch 'origin/master' into sync_msft_24062026

8a61094

ai-fw-intg requested review from Jaswanth51, ankitm3k, jatinwadhwa921 and vthaniel June 23, 2026 20:34

hdharpure9922 self-requested a review June 24, 2026 04:56

Merge branch 'master' into sync_msft_24062026

0710f53

LGTM

hdharpure9922 approved these changes Jun 24, 2026

View reviewed changes

Merge pull request #1160 from hdharpure9922/sync_msft_24062026

91d3f4b

Sync msft 24062026

hdharpure9922 merged commit 3fde1cf into ovep-develop Jun 24, 2026
7 of 9 checks passed

hdharpure9922 deleted the sync_msft_24062026 branch June 25, 2026 04:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sync with Microsoft ONNX Runtime - 24062026#1159

Sync with Microsoft ONNX Runtime - 24062026#1159
hdharpure9922 merged 13 commits into
ovep-developfrom
sync_msft_24062026

ai-fw-intg commented Jun 23, 2026

Uh oh!

hdharpure9922 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants

Uh oh!

Conversation

ai-fw-intg commented Jun 23, 2026

Uh oh!

hdharpure9922 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants