Skip to content

Sync with Microsoft ONNX Runtime - 24062026#1159

Merged
hdharpure9922 merged 13 commits into
ovep-developfrom
sync_msft_24062026
Jun 24, 2026
Merged

Sync with Microsoft ONNX Runtime - 24062026#1159
hdharpure9922 merged 13 commits into
ovep-developfrom
sync_msft_24062026

Conversation

@ai-fw-intg

Copy link
Copy Markdown

Automated daily backmerge from ORT main to ovep-develop. No conflicts detected. Do NOT squash or rebase - use merge commit only.

edgchen1 and others added 11 commits June 22, 2026 13:56
### Description
<!-- Describe your changes. -->

- Split out C++ code into separate C++ experimental header
`onnxruntime_experimental_cxx_api.h`. This can also contain other
auxiliary experimental API-related C++ code.
- Add throwing C++ experimental function accessors. The
`Get_X_FnOrThrow()` variant throws an exception if the experimental API
is unavailable in the build.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Improve experimental API ergonomics for C++.

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…oft#29218)

### Description

The `GroupQueryAttentionTest.BatchedRightPaddedRotaryPrefill_CUDA` test
(added in microsoft#29002) fed **fp32** inputs via `AddInput<float>`. The CUDA
(and WebGPU) GroupQueryAttention kernels only register for
`MLFloat16`/`BFloat16`, so the fp32 node silently fell back to the **CPU
EP** — the `_CUDA` test never actually exercised the CUDA kernel it is
named for. This surfaced as a CI failure on the CUDA test leg after
microsoft#29002 and microsoft#29046 merged.

This PR makes `RunGQAPackedQKVRotaryPrefill` feed **fp16** tensors when
targeting CUDA EP, matching the existing `RunGQASharedKVFp16` convention
and the test's own "loose enough for fp16 rounding" tolerance. The CPU
code path is unchanged.

### Key Changes

- `RunGQAPackedQKVRotaryPrefill` now branches on the target EP:
- CUDA EP: inputs/outputs use `MLFloat16` (converted via `ToFloat16`),
so the node is placed on the real GPU kernel.
  - WebGPU/CPU EP: unchanged (`float`).
- Output is converted back to `float` for the existing comparison logic.

### Testing

- `onnxruntime_provider_test
--gtest_filter='GroupQueryAttentionTest.BatchedRightPaddedRotaryPrefill_CUDA'`
→ **PASSED** (now runs on the CUDA fp16 kernel).
- Full `GroupQueryAttentionTest.*` suite → 47 passed, WebGPU-only tests
skipped locally (no WebGPU EP), no regressions.

### Motivation and Context

Restores genuine CUDA kernel coverage for the right-padded rotary
prefill scenario and fixes the CI failure. Related: microsoft#29002, microsoft#29046.
### Description

PR that introduced issue:
microsoft#29064

Fixes in this PR:

1) Add relevant platform guard in some tests that was previously missing

2) Added the new AVX512 headers that host the kernels to their right
location within the cmake file grouping - previously they were placed in
AVX2 grouping (ultimately the TU that included those headers were
compiled with AVX512 flags - so no harm was done). This fix is more
pedantic than fixing a real issue. The lone .cpp file in that list
didn't include any intrinsics manually but the compiler might use AVX512
now for auto-vectorization with the shuffling. Since that file contains
only the pre-packing functions that are used in production, it is safe.
The "scalar" kernel implementation in that file is mostly a test oracle
- nothing else

Sample failed run (before PR):
https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=1268723&view=results
Sample successful run (with PR):
https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=1273196&view=results

### Motivation and Context
Fix Python packaging pipeline
## Description

Adds CUDA support for GroupQueryAttention QK-Norm by applying per-head
Q/K RMSNorm before RoPE in the fused preprocess path. It also enables
the pre-norm graph fusion for CUDA and allows non-quantized QK-Norm
decode to use XQA, restoring the fast global decode path for
GPT-OSS/Qwen-style shapes while keeping quantized-cache QK-Norm on the
existing fallback path until scale handling is validated.

## Summary of Changes

### CUDA GroupQueryAttention

- Threads q_norm_weight / k_norm_weight and qk_norm_epsilon through CUDA
GQA data/parameters.
- Applies FP32 per-head RMSNorm to Q/K in UnpackRoPEAppend before RoPE
and KV append.
- Adds shared-KV Q-only normalization support.
- Enables non-quantized QK-Norm decode to route through XQA after the
fused preprocess normalizes Q/K.
- Keeps quantized-cache QK-Norm decode gated off XQA pending
normalized-K scale validation.

### Fusion and Schemas

- Enables GroupQueryAttentionPreNormFusion for CUDA and native WebGPU.
- Updates contrib operator schema text and generated ContribOperators.md
to document CUDA/native WebGPU QK-Norm support.
- Updates CPU/JSEP rejection text for unsupported providers.

### Tests, Docs, and Profiling

- Adds CUDA optimizer coverage for the pre-norm fusion.
- Adds Python GQA QK-Norm parity coverage, including explicit FP16/BF16
XQA decode tests.
- Extends GQA profiling helpers with QK-Norm options and documents CUDA
GQA behavior in docs/contrib_ops/cuda/gqa.md.

## Testing

- Built: `ninja onnxruntime_providers_cuda onnxruntime_test_all` in
`build/cu130/Release`.
- Ran: `./onnxruntime_test_all
--gtest_filter="GraphTransformationTests.GroupQueryAttentionPreNormFusion*"`
(11 passed, 2 WebGPU skips).
- Ran: `python -m pytest
test_gqa.py::TestGQAQKNorm::test_gqa_qk_norm_past_xqa
test_gqa.py::TestGQAQKNorm::test_gqa_qk_norm_past_xqa_bf16 -q` (2
passed).
- Ran: `python -m pytest test_gqa.py -k "QKNorm" -q` (38 passed).
- Ran: `git diff --check`.
- Verified routing with `ORT_ENABLE_XQA=1
ORT_ENABLE_ATTENTION_KERNEL_DEBUG_INFO=1`: FP16 and BF16 QK-Norm decode
report `SdpaKernel=XQA`.
- Profiled GPT-OSS-like packed FP16 shape
(`B=1,S=1,past=2048,N=64,Nkv=8,H=64,head_sink,QK-Norm`) with nsys:
`H64::grp8_fp16::kernel_mha` averaged ~8.21 us and
`UnpackRoPEAppend<half, half, 16, 64>` averaged ~2.94 us.

## Checklist

- [x] Tests added/updated
- [x] Documentation updated
- [x] No breaking changes
- [x] CI passes

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
…mbeddings (microsoft#29069)

### Description

Fixes NaN output in the CPU GQA kernel when running batched right-padded
prefill. For padding token positions where `seq_causal_length >
total_seqlen`, the softmax loop was reading beyond the GEMM-filled
region of the attention probs buffer into uninitialized memory,
producing NaN values that propagated through the V GEMM to the output.

**Root cause:** In `ComputeAttentionProbs`, `seq_causal_length =
causal_past_seqlen + seq + 1` grows with each query position. For
right-padded batches, a batch item with `real_len < sequence_length` has
`total_seqlen = real_len`, but padding positions still iterate up to
`sequence_length`, giving `seq_causal_length > total_seqlen`. The QK
GEMM only fills columns `[0, total_seqlen)` — positions beyond that are
uninitialized.

**Fix:** Cap the effective causal length at `total_seqlen` before
computing the softmax window:

```cpp
// gqa_attention_base.h - both float and quantized paths
const size_t effective_causal_length = std::min(seq_causal_length, total_seqlen);
// use effective_causal_length for: local window check, start_offset, window_size, masking loops
```

Applied to both the non-quantized float path (~line 1097) and the
quantized MLAS path (~line 436).

### Motivation and Context

The new test
`GroupQueryAttentionTest.BatchedRightPaddedRotaryPrefill_CPU` (added in
this PR) exercises batched GQA with heterogeneous real sequence lengths
`{4, 2, 6}` padded to `sequence_length=6`. Batch item 1 (`real_len=2`)
has padding tokens at positions 2–5; position 3 triggered the NaN via
uninitialized attention probs memory.

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Jiajia Qin <jiajiaqin@microsoft.com>
### Description

ONNX added a `MaxPool-22` schema, but the DropQDQ selectors still
matched only `MaxPool-12`, so the `DequantizeLinear` / `QuantizeLinear`
around an opset-22 `MaxPool` were no longer dropped. This updates both
DropQDQ selector op-version maps to `{12, 22}`, matching the QDQ
propagation pass. (`MaxPool` stays pinned to integer-capable versions ≥
12.)

### Motivation and Context

Fixes microsoft#28770. Models that optimized to full integer at opset ≤ 21
regressed at opset 22+, blocking users from upgrading their opset.
…x output in data propagation (microsoft#29084)

## Summary

A spec-valid `Shape → Gather(1-D index [-1]) → TopK` model fails to load
since ORT 1.25.0 with:

```
K input must be a one-dimensional tensor of size 1.
```

The model is valid: a rank-1 (single-element) Gather index produces a
rank-1 Gather output, so the value feeding TopK's `K` input is a 1-D
size-1 tensor — exactly what TopK requires. The failure was an **ORT
rank-preservation bug in shape-inference data propagation**, not a
problem with the model.

**Root cause.** `GatherOpDataPropagation::infer()` routed by element
**count** rather than index **rank**: it guarded on `indices.size() ==
1`, which is true for *both* a 0-D scalar index and a 1-D single-element
index, and then unconditionally called `SetInferredShapeScalarValue()`.
That dropped the rank of the spec-valid 1-D size-1 case, so
`Graph::getInputData()` emitted a 0-D (dimensionless) propagated value.
ONNX TopK shape inference then correctly rejected the 0-D `K`. This path
was introduced by microsoft#26269 (partial data propagation to enhance shape
inference).

This reproduces even at `GraphOptimizationLevel.ORT_DISABLE_ALL`, where
constant folding never runs — confirming the cause is data propagation
in shape inference, **not** constant folding (microsoft#26345 was an earlier
mis-attribution; see the corrected analysis).

Fixes the regression reported in microsoft#29072. Corrected root-cause analysis:
microsoft#29072 (comment)

## The fix

- **Gather — rank-based routing.** Distinguish the index rank instead of
its element count. A genuine 0-D scalar index still stores a scalar
value; a rank-1 single-element index now stores a **rank-1** value, so
`getInputData()` emits a `TensorProto` with `dims=[1]` and downstream
TopK sees a valid 1-D size-1 `K`. The index rank is taken from the same
constant initializer the index value comes from (via
`get_initialized_input_values` now reporting the initializer rank),
rather than a second, independently-resolved `NodeArg` shape — removing
a potential source-of-truth drift (EDGE #2).
- **Rank-tolerant elementwise companion (Add/Sub/Mul/Div).** These ops
were scalar-only and would silently stop propagating once an operand
became a rank-1 value (e.g. a `Shape → Gather(1-D idx) → Mul → TopK`
chain), because the custom-propagation result replaces ONNX's
rank-correct fallback. They now accept a single element carried as
either a rank-0 scalar or a rank-1 `[1]` value and keep the output rank
consistent with ONNX broadcasting (rank-1 if any operand is rank-1, else
scalar), so such chains keep propagating end-to-end. Div additionally
guards against division by zero.
- **Shared helper (`data_propagation_value_utils.h`).** Centralizes
reading/writing a single-element shape value while preserving its rank,
used by both the Gather producer and the elementwise consumers so they
cannot disagree on rank. The reader **declines** a rank-1 multi-element
value (it must never collapse to `element[0]`), so a multi-element value
can never be mistaken for a single one.

## Testing

Five `ShapeInferenceV2Test` cases (with fixtures + generators), all
loading the model at **every** optimization level (including
`ORT_DISABLE_ALL`):

- `GatherToTopKRankPreservationTest` — the core `Shape → Gather([-1]) →
TopK` regression; asserts the rank-1 `K` is preserved.
- `GatherMulToTopKRankPreservationTest` — the `… → Gather(1-D idx) → Mul
→ TopK` chain; asserts propagation survives the elementwise op.
- `SinglePropagatedShapeValueGuardTest` — a direct unit test pinning the
shared reader's behavior on each channel (scalar, rank-1 single-element,
rank-1 multi-element, symbolic, empty). **Mutation-proven**: relaxing
the `dim_size()==1` guard makes this test fail, restoring it makes it
pass — so the guard the whole fix hinges on is test-locked.
- `ShapeMulMultiElementNoScalarCollapseTest` — end-to-end check that a
multi-element `Shape → Mul → ConstantOfShape` chain still resolves to
its full rank-2 shape (no bogus scalar collapse).
- `PartialDataPropagationTest` — pre-existing scalar-index coverage,
unchanged.

Full `onnxruntime_test_all` suite passes (0 failures) on top of the
current `main` (opset-27 / ONNX 1.22.0 integration). The
constant-folding memory path (microsoft#26345) is untouched — the diff is
confined to `data_propagation/`, a small `graph.cc` change, and tests.

## Follow-ups (intentionally out of scope for this PR)

- Hardening for a rank ≥ 2 single-element index (e.g. shape `[1,1]`) to
*decline* rather than route as rank-1 — needs its own discriminating
unit test; pathological/non-exporter, worst case is degraded inference
rather than a crash.
- Explicit end-to-end coverage for Add/Sub/Div rank-1 chains (the
shared-reader unit test already covers the read path for all four ops;
only Mul is currently exercised end-to-end).
- Minor readability nits.

## DCO

Commit is DCO signed-off.

---------

Signed-off-by: titaiwangms <titaiwang@microsoft.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… empty (microsoft#28733)

### Description
In `PadNodeGroupSelector::Check` , instead of calling `CheckQDQNodes`
with :
```cpp
int num_dq_inputs = static_cast<int>(dq_nodes.size());
// ...
CheckQDQNodes(graph_viewer, node, redundant_clip_node, dq_nodes, q_nodes, num_dq_inputs)
```
it should be called with :
```cpp
CheckQDQNodes(graph_viewer, node, redundant_clip_node, dq_nodes, q_nodes)
```
because otherwise, nothing checks for the case where `dq_nodes` is
empty.

### Motivation and Context
See issue microsoft#28717 for reference.
When checking Pad node for fusion with Q-DQ nodes in
`PadNodeGroupSelector::Check`, if there is no upstream DQ node,
[onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selectors.cc]():900
will provoke a segmentation fault :
```cpp
const int32_t dt_input_1 = dq_nodes[0]->InputDefs()[0]->TypeAsProto()->tensor_type().elem_type();
```
because there's nothing that checks against `dq_nodes` emptiness
beforehand.
This pull request refactors the `CreateGetVectorOfMapsStringFloat` test
in `test_nontensor_types.cc` to improve type safety and clarity by
switching from `std::vector` to `std::array` for fixed-size data and
updating tensor creation logic. The changes also fix the string
comparison to use the correct array.

**Test improvements and type safety:**

* Replaced `std::vector` with `std::array` for `dims` and `values`
variables, making the code safer and more efficient for fixed-size data.
* Updated the creation of the string tensor by using
`Ort::AllocatorWithDefaultOptions()` and filling the tensor with
`FillStringTensor`, aligning with best practices for string tensor
creation.

**Test correctness:**

* Fixed the assertion to compare against `keys_arr` instead of the
undefined `keys` vector, ensuring the test checks the correct set of
keys.

**Header inclusion:**

* Added `#include <array>` at the top of the file to support the use of
`std::array`.
### Description

The `coreml_proto` target must be properly exported to
`onnxruntimeTargets`. Its include directories must not contain any paths
from the build tree during the install phase.

### Motivation and Context

This change fixes a build failure on macOS when attempting to create a
static library of ONNX Runtime with the CoreML Execution Provider (EP)
using the following command:

```sh
./build.sh --use_coreml --cmake_extra_defines CMAKE_POLICY_VERSION_MINIMUM=3.5
```

**Note**: `CMAKE_POLICY_VERSION_MINIMUM=3.5` is required for modern
macOS environment. The current version of CMake available on Homebrew is
`4.3.1`, and it won't allow `cmake_minimum_required(3.5)`. It seems that
ignoring `cmake_minimum_required(3.5)` in the ONNX Runtime tree is not
harmful.

The build fails because `coreml_proto` has no export sets specified and
it violates CMake's requirement that exported targets must not reference
paths from the build tree when installed. Currently, `coreml_proto`
depends on `${CMAKE_CURRENT_BINARY_DIR}` to locate generated Protobuf
definitions. I suppose these definitions are required only during the
build phase and not needed for the installation.

This change set resolves it by:

- Exporting `coreml_proto` to `${PROJECT_NAME}Targets`.
- Replacing `${CMAKE_CURRENT_BINARY_DIR}` with
`$<BUILD_INTERFACE:${CMAKE_CURRENT_BINARY_DIR}>`.

Signed-off-by: Kaito Udagawa <umireon@kaito.tokyo>

@hdharpure9922 hdharpure9922 left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hdharpure9922 hdharpure9922 merged commit 3fde1cf into ovep-develop Jun 24, 2026
7 of 9 checks passed
@hdharpure9922 hdharpure9922 deleted the sync_msft_24062026 branch June 25, 2026 04:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.