Sync with Microsoft ONNX Runtime - 24062026#1159
Merged
Merged
Conversation
### Description <!-- Describe your changes. --> - Split out C++ code into separate C++ experimental header `onnxruntime_experimental_cxx_api.h`. This can also contain other auxiliary experimental API-related C++ code. - Add throwing C++ experimental function accessors. The `Get_X_FnOrThrow()` variant throws an exception if the experimental API is unavailable in the build. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Improve experimental API ergonomics for C++. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…oft#29218) ### Description The `GroupQueryAttentionTest.BatchedRightPaddedRotaryPrefill_CUDA` test (added in microsoft#29002) fed **fp32** inputs via `AddInput<float>`. The CUDA (and WebGPU) GroupQueryAttention kernels only register for `MLFloat16`/`BFloat16`, so the fp32 node silently fell back to the **CPU EP** — the `_CUDA` test never actually exercised the CUDA kernel it is named for. This surfaced as a CI failure on the CUDA test leg after microsoft#29002 and microsoft#29046 merged. This PR makes `RunGQAPackedQKVRotaryPrefill` feed **fp16** tensors when targeting CUDA EP, matching the existing `RunGQASharedKVFp16` convention and the test's own "loose enough for fp16 rounding" tolerance. The CPU code path is unchanged. ### Key Changes - `RunGQAPackedQKVRotaryPrefill` now branches on the target EP: - CUDA EP: inputs/outputs use `MLFloat16` (converted via `ToFloat16`), so the node is placed on the real GPU kernel. - WebGPU/CPU EP: unchanged (`float`). - Output is converted back to `float` for the existing comparison logic. ### Testing - `onnxruntime_provider_test --gtest_filter='GroupQueryAttentionTest.BatchedRightPaddedRotaryPrefill_CUDA'` → **PASSED** (now runs on the CUDA fp16 kernel). - Full `GroupQueryAttentionTest.*` suite → 47 passed, WebGPU-only tests skipped locally (no WebGPU EP), no regressions. ### Motivation and Context Restores genuine CUDA kernel coverage for the right-padded rotary prefill scenario and fixes the CI failure. Related: microsoft#29002, microsoft#29046.
### Description PR that introduced issue: microsoft#29064 Fixes in this PR: 1) Add relevant platform guard in some tests that was previously missing 2) Added the new AVX512 headers that host the kernels to their right location within the cmake file grouping - previously they were placed in AVX2 grouping (ultimately the TU that included those headers were compiled with AVX512 flags - so no harm was done). This fix is more pedantic than fixing a real issue. The lone .cpp file in that list didn't include any intrinsics manually but the compiler might use AVX512 now for auto-vectorization with the shuffling. Since that file contains only the pre-packing functions that are used in production, it is safe. The "scalar" kernel implementation in that file is mostly a test oracle - nothing else Sample failed run (before PR): https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=1268723&view=results Sample successful run (with PR): https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=1273196&view=results ### Motivation and Context Fix Python packaging pipeline
## Description Adds CUDA support for GroupQueryAttention QK-Norm by applying per-head Q/K RMSNorm before RoPE in the fused preprocess path. It also enables the pre-norm graph fusion for CUDA and allows non-quantized QK-Norm decode to use XQA, restoring the fast global decode path for GPT-OSS/Qwen-style shapes while keeping quantized-cache QK-Norm on the existing fallback path until scale handling is validated. ## Summary of Changes ### CUDA GroupQueryAttention - Threads q_norm_weight / k_norm_weight and qk_norm_epsilon through CUDA GQA data/parameters. - Applies FP32 per-head RMSNorm to Q/K in UnpackRoPEAppend before RoPE and KV append. - Adds shared-KV Q-only normalization support. - Enables non-quantized QK-Norm decode to route through XQA after the fused preprocess normalizes Q/K. - Keeps quantized-cache QK-Norm decode gated off XQA pending normalized-K scale validation. ### Fusion and Schemas - Enables GroupQueryAttentionPreNormFusion for CUDA and native WebGPU. - Updates contrib operator schema text and generated ContribOperators.md to document CUDA/native WebGPU QK-Norm support. - Updates CPU/JSEP rejection text for unsupported providers. ### Tests, Docs, and Profiling - Adds CUDA optimizer coverage for the pre-norm fusion. - Adds Python GQA QK-Norm parity coverage, including explicit FP16/BF16 XQA decode tests. - Extends GQA profiling helpers with QK-Norm options and documents CUDA GQA behavior in docs/contrib_ops/cuda/gqa.md. ## Testing - Built: `ninja onnxruntime_providers_cuda onnxruntime_test_all` in `build/cu130/Release`. - Ran: `./onnxruntime_test_all --gtest_filter="GraphTransformationTests.GroupQueryAttentionPreNormFusion*"` (11 passed, 2 WebGPU skips). - Ran: `python -m pytest test_gqa.py::TestGQAQKNorm::test_gqa_qk_norm_past_xqa test_gqa.py::TestGQAQKNorm::test_gqa_qk_norm_past_xqa_bf16 -q` (2 passed). - Ran: `python -m pytest test_gqa.py -k "QKNorm" -q` (38 passed). - Ran: `git diff --check`. - Verified routing with `ORT_ENABLE_XQA=1 ORT_ENABLE_ATTENTION_KERNEL_DEBUG_INFO=1`: FP16 and BF16 QK-Norm decode report `SdpaKernel=XQA`. - Profiled GPT-OSS-like packed FP16 shape (`B=1,S=1,past=2048,N=64,Nkv=8,H=64,head_sink,QK-Norm`) with nsys: `H64::grp8_fp16::kernel_mha` averaged ~8.21 us and `UnpackRoPEAppend<half, half, 16, 64>` averaged ~2.94 us. ## Checklist - [x] Tests added/updated - [x] Documentation updated - [x] No breaking changes - [x] CI passes --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
…mbeddings (microsoft#29069) ### Description Fixes NaN output in the CPU GQA kernel when running batched right-padded prefill. For padding token positions where `seq_causal_length > total_seqlen`, the softmax loop was reading beyond the GEMM-filled region of the attention probs buffer into uninitialized memory, producing NaN values that propagated through the V GEMM to the output. **Root cause:** In `ComputeAttentionProbs`, `seq_causal_length = causal_past_seqlen + seq + 1` grows with each query position. For right-padded batches, a batch item with `real_len < sequence_length` has `total_seqlen = real_len`, but padding positions still iterate up to `sequence_length`, giving `seq_causal_length > total_seqlen`. The QK GEMM only fills columns `[0, total_seqlen)` — positions beyond that are uninitialized. **Fix:** Cap the effective causal length at `total_seqlen` before computing the softmax window: ```cpp // gqa_attention_base.h - both float and quantized paths const size_t effective_causal_length = std::min(seq_causal_length, total_seqlen); // use effective_causal_length for: local window check, start_offset, window_size, masking loops ``` Applied to both the non-quantized float path (~line 1097) and the quantized MLAS path (~line 436). ### Motivation and Context The new test `GroupQueryAttentionTest.BatchedRightPaddedRotaryPrefill_CPU` (added in this PR) exercises batched GQA with heterogeneous real sequence lengths `{4, 2, 6}` padded to `sequence_length=6`. Batch item 1 (`real_len=2`) has padding tokens at positions 2–5; position 3 triggered the NaN via uninitialized attention probs memory. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: Jiajia Qin <jiajiaqin@microsoft.com>
### Description
ONNX added a `MaxPool-22` schema, but the DropQDQ selectors still
matched only `MaxPool-12`, so the `DequantizeLinear` / `QuantizeLinear`
around an opset-22 `MaxPool` were no longer dropped. This updates both
DropQDQ selector op-version maps to `{12, 22}`, matching the QDQ
propagation pass. (`MaxPool` stays pinned to integer-capable versions ≥
12.)
### Motivation and Context
Fixes microsoft#28770. Models that optimized to full integer at opset ≤ 21
regressed at opset 22+, blocking users from upgrading their opset.
…x output in data propagation (microsoft#29084) ## Summary A spec-valid `Shape → Gather(1-D index [-1]) → TopK` model fails to load since ORT 1.25.0 with: ``` K input must be a one-dimensional tensor of size 1. ``` The model is valid: a rank-1 (single-element) Gather index produces a rank-1 Gather output, so the value feeding TopK's `K` input is a 1-D size-1 tensor — exactly what TopK requires. The failure was an **ORT rank-preservation bug in shape-inference data propagation**, not a problem with the model. **Root cause.** `GatherOpDataPropagation::infer()` routed by element **count** rather than index **rank**: it guarded on `indices.size() == 1`, which is true for *both* a 0-D scalar index and a 1-D single-element index, and then unconditionally called `SetInferredShapeScalarValue()`. That dropped the rank of the spec-valid 1-D size-1 case, so `Graph::getInputData()` emitted a 0-D (dimensionless) propagated value. ONNX TopK shape inference then correctly rejected the 0-D `K`. This path was introduced by microsoft#26269 (partial data propagation to enhance shape inference). This reproduces even at `GraphOptimizationLevel.ORT_DISABLE_ALL`, where constant folding never runs — confirming the cause is data propagation in shape inference, **not** constant folding (microsoft#26345 was an earlier mis-attribution; see the corrected analysis). Fixes the regression reported in microsoft#29072. Corrected root-cause analysis: microsoft#29072 (comment) ## The fix - **Gather — rank-based routing.** Distinguish the index rank instead of its element count. A genuine 0-D scalar index still stores a scalar value; a rank-1 single-element index now stores a **rank-1** value, so `getInputData()` emits a `TensorProto` with `dims=[1]` and downstream TopK sees a valid 1-D size-1 `K`. The index rank is taken from the same constant initializer the index value comes from (via `get_initialized_input_values` now reporting the initializer rank), rather than a second, independently-resolved `NodeArg` shape — removing a potential source-of-truth drift (EDGE #2). - **Rank-tolerant elementwise companion (Add/Sub/Mul/Div).** These ops were scalar-only and would silently stop propagating once an operand became a rank-1 value (e.g. a `Shape → Gather(1-D idx) → Mul → TopK` chain), because the custom-propagation result replaces ONNX's rank-correct fallback. They now accept a single element carried as either a rank-0 scalar or a rank-1 `[1]` value and keep the output rank consistent with ONNX broadcasting (rank-1 if any operand is rank-1, else scalar), so such chains keep propagating end-to-end. Div additionally guards against division by zero. - **Shared helper (`data_propagation_value_utils.h`).** Centralizes reading/writing a single-element shape value while preserving its rank, used by both the Gather producer and the elementwise consumers so they cannot disagree on rank. The reader **declines** a rank-1 multi-element value (it must never collapse to `element[0]`), so a multi-element value can never be mistaken for a single one. ## Testing Five `ShapeInferenceV2Test` cases (with fixtures + generators), all loading the model at **every** optimization level (including `ORT_DISABLE_ALL`): - `GatherToTopKRankPreservationTest` — the core `Shape → Gather([-1]) → TopK` regression; asserts the rank-1 `K` is preserved. - `GatherMulToTopKRankPreservationTest` — the `… → Gather(1-D idx) → Mul → TopK` chain; asserts propagation survives the elementwise op. - `SinglePropagatedShapeValueGuardTest` — a direct unit test pinning the shared reader's behavior on each channel (scalar, rank-1 single-element, rank-1 multi-element, symbolic, empty). **Mutation-proven**: relaxing the `dim_size()==1` guard makes this test fail, restoring it makes it pass — so the guard the whole fix hinges on is test-locked. - `ShapeMulMultiElementNoScalarCollapseTest` — end-to-end check that a multi-element `Shape → Mul → ConstantOfShape` chain still resolves to its full rank-2 shape (no bogus scalar collapse). - `PartialDataPropagationTest` — pre-existing scalar-index coverage, unchanged. Full `onnxruntime_test_all` suite passes (0 failures) on top of the current `main` (opset-27 / ONNX 1.22.0 integration). The constant-folding memory path (microsoft#26345) is untouched — the diff is confined to `data_propagation/`, a small `graph.cc` change, and tests. ## Follow-ups (intentionally out of scope for this PR) - Hardening for a rank ≥ 2 single-element index (e.g. shape `[1,1]`) to *decline* rather than route as rank-1 — needs its own discriminating unit test; pathological/non-exporter, worst case is degraded inference rather than a crash. - Explicit end-to-end coverage for Add/Sub/Div rank-1 chains (the shared-reader unit test already covers the read path for all four ops; only Mul is currently exercised end-to-end). - Minor readability nits. ## DCO Commit is DCO signed-off. --------- Signed-off-by: titaiwangms <titaiwang@microsoft.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… empty (microsoft#28733) ### Description In `PadNodeGroupSelector::Check` , instead of calling `CheckQDQNodes` with : ```cpp int num_dq_inputs = static_cast<int>(dq_nodes.size()); // ... CheckQDQNodes(graph_viewer, node, redundant_clip_node, dq_nodes, q_nodes, num_dq_inputs) ``` it should be called with : ```cpp CheckQDQNodes(graph_viewer, node, redundant_clip_node, dq_nodes, q_nodes) ``` because otherwise, nothing checks for the case where `dq_nodes` is empty. ### Motivation and Context See issue microsoft#28717 for reference. When checking Pad node for fusion with Q-DQ nodes in `PadNodeGroupSelector::Check`, if there is no upstream DQ node, [onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selectors.cc]():900 will provoke a segmentation fault : ```cpp const int32_t dt_input_1 = dq_nodes[0]->InputDefs()[0]->TypeAsProto()->tensor_type().elem_type(); ``` because there's nothing that checks against `dq_nodes` emptiness beforehand.
This pull request refactors the `CreateGetVectorOfMapsStringFloat` test in `test_nontensor_types.cc` to improve type safety and clarity by switching from `std::vector` to `std::array` for fixed-size data and updating tensor creation logic. The changes also fix the string comparison to use the correct array. **Test improvements and type safety:** * Replaced `std::vector` with `std::array` for `dims` and `values` variables, making the code safer and more efficient for fixed-size data. * Updated the creation of the string tensor by using `Ort::AllocatorWithDefaultOptions()` and filling the tensor with `FillStringTensor`, aligning with best practices for string tensor creation. **Test correctness:** * Fixed the assertion to compare against `keys_arr` instead of the undefined `keys` vector, ensuring the test checks the correct set of keys. **Header inclusion:** * Added `#include <array>` at the top of the file to support the use of `std::array`.
### Description
The `coreml_proto` target must be properly exported to
`onnxruntimeTargets`. Its include directories must not contain any paths
from the build tree during the install phase.
### Motivation and Context
This change fixes a build failure on macOS when attempting to create a
static library of ONNX Runtime with the CoreML Execution Provider (EP)
using the following command:
```sh
./build.sh --use_coreml --cmake_extra_defines CMAKE_POLICY_VERSION_MINIMUM=3.5
```
**Note**: `CMAKE_POLICY_VERSION_MINIMUM=3.5` is required for modern
macOS environment. The current version of CMake available on Homebrew is
`4.3.1`, and it won't allow `cmake_minimum_required(3.5)`. It seems that
ignoring `cmake_minimum_required(3.5)` in the ONNX Runtime tree is not
harmful.
The build fails because `coreml_proto` has no export sets specified and
it violates CMake's requirement that exported targets must not reference
paths from the build tree when installed. Currently, `coreml_proto`
depends on `${CMAKE_CURRENT_BINARY_DIR}` to locate generated Protobuf
definitions. I suppose these definitions are required only during the
build phase and not needed for the installation.
This change set resolves it by:
- Exporting `coreml_proto` to `${PROJECT_NAME}Targets`.
- Replacing `${CMAKE_CURRENT_BINARY_DIR}` with
`$<BUILD_INTERFACE:${CMAKE_CURRENT_BINARY_DIR}>`.
Signed-off-by: Kaito Udagawa <umireon@kaito.tokyo>
Sync msft 24062026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Automated daily backmerge from ORT main to ovep-develop. No conflicts detected. Do NOT squash or rebase - use merge commit only.