Sync with Microsoft ONNX Runtime - 25062026 by ai-fw-intg · Pull Request #1161 · intel/onnxruntime

ai-fw-intg · 2026-06-24T20:34:03Z

Automated daily backmerge from ORT main to ovep-develop. No conflicts detected. Do NOT squash or rebase - use merge commit only.

## Description The XQA decode kernel previously fell back to FlashDecode whenever a local (sliding) attention window was configured, so GPT-OSS / Mistral / Gemma2 style models could not use the faster XQA path on their sliding-window layers. This PR wires `local_window_size` through the fp16/bf16 XQA kernels so they serve both global and sliding-window attention, and adds parity tests that confirm the new path is exercised. ## Summary of Changes ### Sliding-window XQA kernel | File | Change | |------|--------| | `onnxruntime/contrib_ops/cuda/bert/group_query_attention.cc` | Drop the `local_window_size == -1` gate for XQA path; keep INT8/FP8 variants global-only via a new `is_global_attention` guard. | | `onnxruntime/contrib_ops/cuda/bert/group_query_attention_impl.cu` | Pass `parameters.local_window_size` into `ExtremeDecoding`. | | `onnxruntime/contrib_ops/cuda/bert/xqa/xqa_impl_gen.cuh` | Map ORT `local_window_size` (`-1` → `max_seq_len`, else the value) to XQA `slidingWinSize`, guarded by `#if SLIDING_WINDOW`. | | `onnxruntime/contrib_ops/cuda/bert/xqa/xqa_loader.h`, `xqa_loader_fp16*.{cu,cuh}`, `xqa_loader_bf16*.{cu,cuh}` | Thread a new `local_window_size` parameter through the launch path; enable `#define SLIDING_WINDOW 1` in the fp16/bf16 impl headers. | Global attention (`local_window_size == -1`) maps to a window `>= max_seq_len`, so the kernel's runtime masking guard is never taken — numerically identical to the previous global-only behavior with zero added overhead. ### Tests and profiling - `onnxruntime/test/python/transformers/test_gqa.py`: new `TestXQASlidingWindowParity` class and `gqa_xqa_sliding_window_test_cases()` generator (fp16/bf16 × head_size {64, 128} × group {4, 8} × past/window relationships × with/without head_sink), forcing `ORT_ENABLE_XQA=1` and checking parity against the reference. - `onnxruntime/test/python/transformers/profile_gqa.sh`: add a `--gpt-oss` preset and a `--compare-xqa` mode that profiles XQA vs FlashDecode for the same shape. ### Documentation - `docs/contrib_ops/cuda/gqa.md` (new) replaces `docs/contrib_ops/gqa.md`, documenting the CUDA GroupQueryAttention backends and dispatch. ## Testing - `cd onnxruntime/test/python/transformers && PYTHONPATH=<build_dir> python test_gqa.py TestXQASlidingWindowParity` — all 32 cases pass on H200 (SM90). - Kernel selection verified via `ORT_ENABLE_ATTENTION_KERNEL_DEBUG_INFO=1` (`SdpaKernel=XQA`) and an `nsys` trace showing `H64::grp4_fp16::kernel_mha` launches instead of `flash_fwd_splitkv_kernel`. ## Motivation and Context GPT-OSS-20B has 12 sliding-window layers (`local_window_size=128`, head_sink, fp16, 64 q / 8 kv heads, head_size 64). On H200 single-token decode the XQA kernel is ~2.2× faster than FlashDecode on these shapes, so enabling XQA for the sliding-window layers improves end-to-end decode latency. ## Checklist - [x] Tests added/updated - [x] Documentation updated - [x] No breaking changes (global-only behavior preserved; quantized paths unchanged) - [x] CI passes --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>

### Description  The webgpu-local-testing skill is failing to load because of invalid YAML in its frontmatter. The unquoated description: value contained colon-space sequences ( SCOPE: lavapipe , e.g.: ), and according to Copilot, in YAML a plain (unquoted) scalar can't contain ": ". The parser reads it as a nested mapping key and aborts with: ScannerError: mapping values are not allowed here It was the only one of the 8 skills with this pattern, which is why every other skill loaded fine. The fix is to wrap the description value in double quotes and adjust `SCOPE:` to `SCOPE -` so the colons are treated as literal text. The frontmatter now parses, with both required keys (name, description) intact. ### Motivation and Context  The Copilot CLI was flagging this skill as failing to load, so this change attempts to resolve that error. Co-authored-by: Aditya Rastogi <adityar@ntdev.microsoft.com>

### Description The ONNX 1.22 release is returning 27 with the API `onnx_opset_version()` and this is the latest "in development" opset in ONNX and not the released opset version. This breaks tests in ORT as there is a validation check. So adjust the tests so that the test models are stamped with the latest release opset version. ### Motivation and Context Fix packaging pipeline break Successful run - https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=1277057&view=results

### Description Fuse the MoE router `MatMulNBits + Add([32] bias)` pattern into the CUDA `MatMulNBits` router GEMV path. This PR keeps the public surface conservative: - no QMoE op schema change; - no new router/top-k QMoE inputs; - the optimized path is exact-shape gated to the GPT-OSS router projection: `M=1`, `N=32`, `K=2880`, 4-bit weights, `block_size=32`, no zero points; - all other `MatMulNBits` shapes continue to use the existing generic path; - `ORT_DISABLE_QMOE_ROUTER_GEMV_SPECIALIZATION=1` disables the exact router GEMV specialization; - `ORT_DISABLE_QMOE_ROUTER_BIAS_FUSION=1` disables only the graph rewrite that folds the router bias into `MatMulNBits`. ### Motivation and Context GPT-OSS-20B decode runs a tiny router projection before each QMoE node. The router projection is an exact-shape int4 `MatMulNBits`, followed by a `[32]` bias add before `QMoE` consumes the router logits. The existing generic int4 GEMV works, but this router shape is small enough that specializing it reduces router GEMV overhead. Once that specialization is active, folding the `[32]` bias into the same kernel removes the remaining router-side `Add` launch without changing the QMoE op contract. ### Key Changes - Adds an exact-shape CUDA router GEMV specialization in `MatMulFloatInt4RouterKernel`. - Extends the CUDA `MatMulNBits` path to pass an optional bias pointer to the router specialization. - Extends `MatMulNBitsFusion` to rewrite the exact GPT-OSS router `MatMulNBits + Add` chain into biased `MatMulNBits`. - Keeps the transformer registration compatible with the current `origin/main` WebGPU kernel-gated MatMulNBits fusion logic. - Adds graph transformer and MatMul4Bits coverage for the specialization, fallback, and bias-fusion opt-out behavior. - Records the router GEMV and router bias fusion measurements in the QMoE GEMV experiment log. ### Validation Completed locally on the clean PR branch: - `lintrunner -a docs/contrib_ops/cuda/qmoe_gemv_experiments.md onnxruntime/contrib_ops/cuda/quantization/matmul_4bits.cu onnxruntime/contrib_ops/cuda/quantization/matmul_nbits.cc onnxruntime/contrib_ops/cuda/quantization/matmul_nbits.cuh onnxruntime/core/optimizer/graph_transformer_utils.cc onnxruntime/core/optimizer/matmul_nbits_fusion.cc onnxruntime/test/contrib_ops/cuda_kernels/fpA_intB_gemm_kernel_test.cc onnxruntime/test/contrib_ops/matmul_4bits_test.cc onnxruntime/test/optimizer/graph_transform_test.cc` - `git diff --check` - `git diff --cached --check` Previously collected on the experiment branch before preparing this PR branch: - Graph transformer tests for router GEMV/bias fusion passed. - MatMul4Bits provider coverage for router GEMV specialization/fallback passed. - Nsight confirmed the exact router specialization dispatches for GPT-OSS decode router projections. - CUDA-graph GPT-OSS decode A/B showed the router GEMV specialization at about `+1.6%` to `+1.8%` throughput. - Router bias fusion removed all 24 real GPT-OSS router bias `Add` nodes and measured about `+0.2%` throughput after the router GEMV specialization. Compiled C++ tests were not rerun from this new worktree because it does not have a configured build directory; CI should provide the full compiled validation matrix.

…ft#29021) ### Description Add CPU time offset to WebGPU GPU profiling timestamps so they align with the ORT profiler's time base (microseconds since profiling start). Previously GPU events started from 0, causing misalignment in trace viewers. ### Motivation and Context See above.

…ft#29017) ### Description The native WebGPU EP already supports the buffer cache mode options (`ep.webgpuexecutionprovider.storageBufferCacheMode` and friends), but onnxruntime-web never forwarded them from `executionProviders`, so they were unreachable from JS. This adds `storageBufferCacheMode`, `uniformBufferCacheMode`, `queryResolveBufferCacheMode` and `defaultBufferCacheMode` to `WebGpuExecutionProviderOption` and forwards them to the EP the same way `validationMode` is forwarded today, with the values validated against the set the native side accepts. The options ride the existing `SessionOptionsAppendExecutionProvider` path, which prefixes each key into exactly the config entry the EP reads, so no native changes are needed. ### Motivation and Context Fixes microsoft#29016. For static shape models, `storageBufferCacheMode: 'simple'` reuses exact size buffers across runs instead of allocating new bucket sized ones, which the issue's repro shows cutting peak WebGPU memory by about 27 percent. Verified locally with tsc builds of js/common and js/web, prettier and eslint, the js/common unit tests, and type level checks that the new options compile and invalid values are rejected. --------- Signed-off-by: Samaresh Kumar Singh <ssam3003@gmail.com>

### Description Fixes two NVCC 13.3 (`cudafe++` / EDG front-end) parse regressions that break the Linux CUDA build of ONNX Runtime. Both are host-side parser bugs in the CUDA 13.3 toolkit that reject valid C++ which compiles fine on CUDA 13.2 and earlier. 1. **Abseil member alias template.** NVCC 13.3 mis-parses the qualified-id `IfRRef<...>::AddPtr<Other>` used inside abseil's `insert_or_assign` / `try_emplace` macros, failing with `using template type parameter ... after 'typename'`. A new patch introduces a top-level alias template `IfRRefAddPtr<T, Other>` and routes the macros through it. Because it stays an alias template, substitution remains in the immediate context, so forming a pointer-to-reference is still a soft (SFINAE) failure rather than a hard error — the original behavior is preserved. 2. **CCCL global-qualified partial specializations.** `<cub/device/device_transform.cuh>` and `<cub/device/dispatch/tuning/tuning_transform.cuh>` declare `struct ::cuda::proclaims_copyable_arguments<...> : ::cuda::std::true_type {};` at global scope, which NVCC 13.3 rejects with `global qualification of class name is invalid before ':' token`. Since the affected headers ship inside the (often read-only) CUDA toolkit, the build now generates corrected copies — rewriting the specializations into namespace-reopened form (`_CCCL_BEGIN_NAMESPACE_CUDA ... _CCCL_END_NAMESPACE_CUDA`) — into the build tree and places that directory ahead of the toolkit CCCL include path. The transform is a no-op on toolkits that do not contain the offending pattern, so it is safe to keep enabled across CUDA versions. ### Summary of changes | File | Change | |------|--------| | `cmake/patches/abseil/absl_cuda13_member_template.patch` | New patch adding the `IfRRefAddPtr` alias template and rewriting the abseil container macros to use it. | | `cmake/vcpkg-ports/abseil/absl_cuda13_member_template.patch` | Same patch copied into the vcpkg overlay port (vcpkg looks for patches in the port directory). | | `cmake/vcpkg-ports/abseil/portfile.cmake` | Add the new patch to the abseil overlay port `PATCHES` list. | | `cmake/external/abseil-cpp.cmake` | Apply the new patch in the non-vcpkg FetchContent path (both Windows and non-Windows branches). | | `cmake/onnxruntime_providers_cuda.cmake` | Add `ort_cuda13_patch_cccl_header()` and, for CUDA >= 13.0, generate fixed CCCL headers into the build tree and prepend that directory to the CUDA include path. | ### Motivation and Context The CUDA 13.3 toolkit introduced `cudafe++` parser regressions that reject valid template code accepted by CUDA 13.2 and earlier, so the Linux CUDA build fails before producing any libraries. These workarounds restore the build on CUDA 13.3 while remaining no-ops on toolkits without the regressions, so existing CUDA versions are unaffected. - Related upstream issue: abseil/abseil-cpp#2075 ### How was this tested? - Full Linux build with CUDA 13.3 + cuDNN 9.23 (`CMAKE_CUDA_ARCHITECTURES="89;90"`, Release) completes successfully and produces the `onnxruntime_gpu` wheel; the two previously-failing translation units (`bias_softmax_impl.cu` and `moe_kernel.cu`) now compile. - The CMake-generated CCCL headers were verified byte-identical to a manually-fixed reference that compiles the affected files with `exit 0`.

tianleiwu and others added 8 commits June 23, 2026 21:21

Merge remote-tracking branch 'origin/master' into sync_msft_25062026

e34f53b

ai-fw-intg requested review from Jaswanth51, ankitm3k, jatinwadhwa921 and vthaniel June 24, 2026 20:34

hdharpure9922 self-requested a review June 25, 2026 04:16

hdharpure9922 approved these changes Jun 25, 2026 •

edited

Loading

View reviewed changes

hdharpure9922 merged commit 28b6f4c into ovep-develop Jun 25, 2026
7 of 9 checks passed

hdharpure9922 deleted the sync_msft_25062026 branch June 25, 2026 06:11

hdharpure9922 restored the sync_msft_25062026 branch June 25, 2026 06:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sync with Microsoft ONNX Runtime - 25062026#1161

Sync with Microsoft ONNX Runtime - 25062026#1161
hdharpure9922 merged 8 commits into
ovep-developfrom
sync_msft_25062026

ai-fw-intg commented Jun 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Uh oh!

Conversation

ai-fw-intg commented Jun 24, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants