Skip to content

Sync with Microsoft ONNX Runtime - 25062026#1161

Merged
hdharpure9922 merged 8 commits into
ovep-developfrom
sync_msft_25062026
Jun 25, 2026
Merged

Sync with Microsoft ONNX Runtime - 25062026#1161
hdharpure9922 merged 8 commits into
ovep-developfrom
sync_msft_25062026

Conversation

@ai-fw-intg

Copy link
Copy Markdown

Automated daily backmerge from ORT main to ovep-develop. No conflicts detected. Do NOT squash or rebase - use merge commit only.

tianleiwu and others added 8 commits June 23, 2026 21:21
## Description

The XQA decode kernel previously fell back to FlashDecode whenever a
local
(sliding) attention window was configured, so GPT-OSS / Mistral / Gemma2
style
models could not use the faster XQA path on their sliding-window layers.
This PR
wires `local_window_size` through the fp16/bf16 XQA kernels so they
serve both global and sliding-window attention, and adds parity tests
that confirm
the new path is exercised.

## Summary of Changes

### Sliding-window XQA kernel

| File | Change |
|------|--------|
| `onnxruntime/contrib_ops/cuda/bert/group_query_attention.cc` | Drop
the `local_window_size == -1` gate for XQA path; keep INT8/FP8 variants
global-only via a new `is_global_attention` guard. |
| `onnxruntime/contrib_ops/cuda/bert/group_query_attention_impl.cu` |
Pass `parameters.local_window_size` into `ExtremeDecoding`. |
| `onnxruntime/contrib_ops/cuda/bert/xqa/xqa_impl_gen.cuh` | Map ORT
`local_window_size` (`-1` → `max_seq_len`, else the value) to XQA
`slidingWinSize`, guarded by `#if SLIDING_WINDOW`. |
| `onnxruntime/contrib_ops/cuda/bert/xqa/xqa_loader.h`,
`xqa_loader_fp16*.{cu,cuh}`, `xqa_loader_bf16*.{cu,cuh}` | Thread a new
`local_window_size` parameter through the launch path; enable `#define
SLIDING_WINDOW 1` in the fp16/bf16 impl headers. |

Global attention (`local_window_size == -1`) maps to a window `>=
max_seq_len`, so
the kernel's runtime masking guard is never taken — numerically
identical to the
previous global-only behavior with zero added overhead.

### Tests and profiling

- `onnxruntime/test/python/transformers/test_gqa.py`: new
`TestXQASlidingWindowParity` class and
`gqa_xqa_sliding_window_test_cases()` generator (fp16/bf16 × head_size
{64, 128} × group {4, 8} × past/window relationships × with/without
head_sink), forcing `ORT_ENABLE_XQA=1` and checking parity against the
reference.
- `onnxruntime/test/python/transformers/profile_gqa.sh`: add a
`--gpt-oss` preset and a `--compare-xqa` mode that profiles XQA vs
FlashDecode for the same shape.

### Documentation

- `docs/contrib_ops/cuda/gqa.md` (new) replaces
`docs/contrib_ops/gqa.md`, documenting the CUDA GroupQueryAttention
backends and dispatch.

## Testing

- `cd onnxruntime/test/python/transformers && PYTHONPATH=<build_dir>
python test_gqa.py TestXQASlidingWindowParity` — all 32 cases pass on
H200 (SM90).
- Kernel selection verified via
`ORT_ENABLE_ATTENTION_KERNEL_DEBUG_INFO=1` (`SdpaKernel=XQA`) and an
`nsys` trace showing `H64::grp4_fp16::kernel_mha` launches instead of
`flash_fwd_splitkv_kernel`.

## Motivation and Context

GPT-OSS-20B has 12 sliding-window layers (`local_window_size=128`,
head_sink, fp16,
64 q / 8 kv heads, head_size 64). On H200 single-token decode the XQA
kernel is
~2.2× faster than FlashDecode on these shapes, so enabling XQA for the
sliding-window layers improves end-to-end decode latency.

## Checklist

- [x] Tests added/updated
- [x] Documentation updated
- [x] No breaking changes (global-only behavior preserved; quantized
paths unchanged)
- [x] CI passes

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
### Description
<!-- Describe your changes. -->
The  webgpu-local-testing skill is failing to load because of invalid
YAML in its frontmatter. The unquoated description: value contained
colon-space sequences ( SCOPE: lavapipe ,  e.g.: ), and according to
Copilot, in YAML a plain (unquoted) scalar can't contain ": ". The
parser reads it as a nested mapping key and aborts with:

 ScannerError: mapping values are not allowed here

It was the only one of the 8 skills with this pattern, which is why
every other skill loaded fine. The fix is to wrap the description value
in double quotes and adjust `SCOPE:`  to `SCOPE -` so the colons are
treated as literal text. The frontmatter now parses, with both required
keys (name, description) intact.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
The Copilot CLI was flagging this skill as failing to load, so this
change attempts to resolve that error.

Co-authored-by: Aditya Rastogi <adityar@ntdev.microsoft.com>
### Description
The ONNX 1.22 release is returning 27 with the API
`onnx_opset_version()` and this is the latest "in development" opset in
ONNX and not the released opset version. This breaks tests in ORT as
there is a validation check. So adjust the tests so that the test models
are stamped with the latest release opset version.


### Motivation and Context
Fix packaging pipeline break

Successful run -
https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=1277057&view=results
### Description

Fuse the MoE router `MatMulNBits + Add([32] bias)` pattern into the CUDA
`MatMulNBits` router GEMV path.

This PR keeps the public surface conservative:

- no QMoE op schema change;
- no new router/top-k QMoE inputs;
- the optimized path is exact-shape gated to the GPT-OSS router
projection: `M=1`, `N=32`, `K=2880`, 4-bit weights, `block_size=32`, no
zero points;
- all other `MatMulNBits` shapes continue to use the existing generic
path;
- `ORT_DISABLE_QMOE_ROUTER_GEMV_SPECIALIZATION=1` disables the exact
router GEMV specialization;
- `ORT_DISABLE_QMOE_ROUTER_BIAS_FUSION=1` disables only the graph
rewrite that folds the router bias into `MatMulNBits`.

### Motivation and Context

GPT-OSS-20B decode runs a tiny router projection before each QMoE node.
The router projection is an exact-shape int4 `MatMulNBits`, followed by
a `[32]` bias add before `QMoE` consumes the router logits.

The existing generic int4 GEMV works, but this router shape is small
enough that specializing it reduces router GEMV overhead. Once that
specialization is active, folding the `[32]` bias into the same kernel
removes the remaining router-side `Add` launch without changing the QMoE
op contract.

### Key Changes

- Adds an exact-shape CUDA router GEMV specialization in
`MatMulFloatInt4RouterKernel`.
- Extends the CUDA `MatMulNBits` path to pass an optional bias pointer
to the router specialization.
- Extends `MatMulNBitsFusion` to rewrite the exact GPT-OSS router
`MatMulNBits + Add` chain into biased `MatMulNBits`.
- Keeps the transformer registration compatible with the current
`origin/main` WebGPU kernel-gated MatMulNBits fusion logic.
- Adds graph transformer and MatMul4Bits coverage for the
specialization, fallback, and bias-fusion opt-out behavior.
- Records the router GEMV and router bias fusion measurements in the
QMoE GEMV experiment log.

### Validation

Completed locally on the clean PR branch:

- `lintrunner -a docs/contrib_ops/cuda/qmoe_gemv_experiments.md
onnxruntime/contrib_ops/cuda/quantization/matmul_4bits.cu
onnxruntime/contrib_ops/cuda/quantization/matmul_nbits.cc
onnxruntime/contrib_ops/cuda/quantization/matmul_nbits.cuh
onnxruntime/core/optimizer/graph_transformer_utils.cc
onnxruntime/core/optimizer/matmul_nbits_fusion.cc
onnxruntime/test/contrib_ops/cuda_kernels/fpA_intB_gemm_kernel_test.cc
onnxruntime/test/contrib_ops/matmul_4bits_test.cc
onnxruntime/test/optimizer/graph_transform_test.cc`
- `git diff --check`
- `git diff --cached --check`

Previously collected on the experiment branch before preparing this PR
branch:

- Graph transformer tests for router GEMV/bias fusion passed.
- MatMul4Bits provider coverage for router GEMV specialization/fallback
passed.
- Nsight confirmed the exact router specialization dispatches for
GPT-OSS decode router projections.
- CUDA-graph GPT-OSS decode A/B showed the router GEMV specialization at
about `+1.6%` to `+1.8%` throughput.
- Router bias fusion removed all 24 real GPT-OSS router bias `Add` nodes
and measured about `+0.2%` throughput after the router GEMV
specialization.

Compiled C++ tests were not rerun from this new worktree because it does
not have a configured build directory; CI should provide the full
compiled validation matrix.
…ft#29021)

### Description
Add CPU time offset to WebGPU GPU profiling timestamps so they align
with the ORT profiler's time base (microseconds since profiling start).
Previously GPU events started from 0, causing misalignment in trace
viewers.


### Motivation and Context
See above.
…ft#29017)

### Description
The native WebGPU EP already supports the buffer cache mode options
(`ep.webgpuexecutionprovider.storageBufferCacheMode` and friends), but
onnxruntime-web never forwarded them from `executionProviders`, so they
were unreachable from JS. This adds `storageBufferCacheMode`,
`uniformBufferCacheMode`, `queryResolveBufferCacheMode` and
`defaultBufferCacheMode` to `WebGpuExecutionProviderOption` and forwards
them to the EP the same way `validationMode` is forwarded today, with
the values validated against the set the native side accepts. The
options ride the existing `SessionOptionsAppendExecutionProvider` path,
which prefixes each key into exactly the config entry the EP reads, so
no native changes are needed.

### Motivation and Context
Fixes microsoft#29016. For static shape models, `storageBufferCacheMode:
'simple'` reuses exact size buffers across runs instead of allocating
new bucket sized ones, which the issue's repro shows cutting peak WebGPU
memory by about 27 percent. Verified locally with tsc builds of
js/common and js/web, prettier and eslint, the js/common unit tests, and
type level checks that the new options compile and invalid values are
rejected.

---------

Signed-off-by: Samaresh Kumar Singh <ssam3003@gmail.com>
### Description

Fixes two NVCC 13.3 (`cudafe++` / EDG front-end) parse regressions that
break the Linux CUDA build of ONNX Runtime. Both are host-side parser
bugs in the CUDA 13.3 toolkit that reject valid C++ which compiles fine
on CUDA 13.2 and earlier.

1. **Abseil member alias template.** NVCC 13.3 mis-parses the
qualified-id `IfRRef<...>::AddPtr<Other>` used inside abseil's
`insert_or_assign` / `try_emplace` macros, failing with `using template
type parameter ... after 'typename'`. A new patch introduces a top-level
alias template `IfRRefAddPtr<T, Other>` and routes the macros through
it. Because it stays an alias template, substitution remains in the
immediate context, so forming a pointer-to-reference is still a soft
(SFINAE) failure rather than a hard error — the original behavior is
preserved.

2. **CCCL global-qualified partial specializations.**
`<cub/device/device_transform.cuh>` and
`<cub/device/dispatch/tuning/tuning_transform.cuh>` declare `struct
::cuda::proclaims_copyable_arguments<...> : ::cuda::std::true_type {};`
at global scope, which NVCC 13.3 rejects with `global qualification of
class name is invalid before ':' token`. Since the affected headers ship
inside the (often read-only) CUDA toolkit, the build now generates
corrected copies — rewriting the specializations into namespace-reopened
form (`_CCCL_BEGIN_NAMESPACE_CUDA ... _CCCL_END_NAMESPACE_CUDA`) — into
the build tree and places that directory ahead of the toolkit CCCL
include path. The transform is a no-op on toolkits that do not contain
the offending pattern, so it is safe to keep enabled across CUDA
versions.

### Summary of changes

| File | Change |
|------|--------|
| `cmake/patches/abseil/absl_cuda13_member_template.patch` | New patch
adding the `IfRRefAddPtr` alias template and rewriting the abseil
container macros to use it. |
| `cmake/vcpkg-ports/abseil/absl_cuda13_member_template.patch` | Same
patch copied into the vcpkg overlay port (vcpkg looks for patches in the
port directory). |
| `cmake/vcpkg-ports/abseil/portfile.cmake` | Add the new patch to the
abseil overlay port `PATCHES` list. |
| `cmake/external/abseil-cpp.cmake` | Apply the new patch in the
non-vcpkg FetchContent path (both Windows and non-Windows branches). |
| `cmake/onnxruntime_providers_cuda.cmake` | Add
`ort_cuda13_patch_cccl_header()` and, for CUDA >= 13.0, generate fixed
CCCL headers into the build tree and prepend that directory to the CUDA
include path. |

### Motivation and Context

The CUDA 13.3 toolkit introduced `cudafe++` parser regressions that
reject valid template code accepted by CUDA 13.2 and earlier, so the
Linux CUDA build fails before producing any libraries. These workarounds
restore the build on CUDA 13.3 while remaining no-ops on toolkits
without the regressions, so existing CUDA versions are unaffected.

- Related upstream issue:
abseil/abseil-cpp#2075

### How was this tested?

- Full Linux build with CUDA 13.3 + cuDNN 9.23
(`CMAKE_CUDA_ARCHITECTURES="89;90"`, Release) completes successfully and
produces the `onnxruntime_gpu` wheel; the two previously-failing
translation units (`bias_softmax_impl.cu` and `moe_kernel.cu`) now
compile.
- The CMake-generated CCCL headers were verified byte-identical to a
manually-fixed reference that compiles the affected files with `exit 0`.
hdharpure9922
hdharpure9922 approved these changes Jun 25, 2026
@hdharpure9922 hdharpure9922 merged commit 28b6f4c into ovep-develop Jun 25, 2026
7 of 9 checks passed
@hdharpure9922 hdharpure9922 deleted the sync_msft_25062026 branch June 25, 2026 06:11
@hdharpure9922 hdharpure9922 restored the sync_msft_25062026 branch June 25, 2026 06:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants