Sync with Microsoft ONNX Runtime - 25062026#1161
Merged
Merged
Conversation
## Description
The XQA decode kernel previously fell back to FlashDecode whenever a
local
(sliding) attention window was configured, so GPT-OSS / Mistral / Gemma2
style
models could not use the faster XQA path on their sliding-window layers.
This PR
wires `local_window_size` through the fp16/bf16 XQA kernels so they
serve both global and sliding-window attention, and adds parity tests
that confirm
the new path is exercised.
## Summary of Changes
### Sliding-window XQA kernel
| File | Change |
|------|--------|
| `onnxruntime/contrib_ops/cuda/bert/group_query_attention.cc` | Drop
the `local_window_size == -1` gate for XQA path; keep INT8/FP8 variants
global-only via a new `is_global_attention` guard. |
| `onnxruntime/contrib_ops/cuda/bert/group_query_attention_impl.cu` |
Pass `parameters.local_window_size` into `ExtremeDecoding`. |
| `onnxruntime/contrib_ops/cuda/bert/xqa/xqa_impl_gen.cuh` | Map ORT
`local_window_size` (`-1` → `max_seq_len`, else the value) to XQA
`slidingWinSize`, guarded by `#if SLIDING_WINDOW`. |
| `onnxruntime/contrib_ops/cuda/bert/xqa/xqa_loader.h`,
`xqa_loader_fp16*.{cu,cuh}`, `xqa_loader_bf16*.{cu,cuh}` | Thread a new
`local_window_size` parameter through the launch path; enable `#define
SLIDING_WINDOW 1` in the fp16/bf16 impl headers. |
Global attention (`local_window_size == -1`) maps to a window `>=
max_seq_len`, so
the kernel's runtime masking guard is never taken — numerically
identical to the
previous global-only behavior with zero added overhead.
### Tests and profiling
- `onnxruntime/test/python/transformers/test_gqa.py`: new
`TestXQASlidingWindowParity` class and
`gqa_xqa_sliding_window_test_cases()` generator (fp16/bf16 × head_size
{64, 128} × group {4, 8} × past/window relationships × with/without
head_sink), forcing `ORT_ENABLE_XQA=1` and checking parity against the
reference.
- `onnxruntime/test/python/transformers/profile_gqa.sh`: add a
`--gpt-oss` preset and a `--compare-xqa` mode that profiles XQA vs
FlashDecode for the same shape.
### Documentation
- `docs/contrib_ops/cuda/gqa.md` (new) replaces
`docs/contrib_ops/gqa.md`, documenting the CUDA GroupQueryAttention
backends and dispatch.
## Testing
- `cd onnxruntime/test/python/transformers && PYTHONPATH=<build_dir>
python test_gqa.py TestXQASlidingWindowParity` — all 32 cases pass on
H200 (SM90).
- Kernel selection verified via
`ORT_ENABLE_ATTENTION_KERNEL_DEBUG_INFO=1` (`SdpaKernel=XQA`) and an
`nsys` trace showing `H64::grp4_fp16::kernel_mha` launches instead of
`flash_fwd_splitkv_kernel`.
## Motivation and Context
GPT-OSS-20B has 12 sliding-window layers (`local_window_size=128`,
head_sink, fp16,
64 q / 8 kv heads, head_size 64). On H200 single-token decode the XQA
kernel is
~2.2× faster than FlashDecode on these shapes, so enabling XQA for the
sliding-window layers improves end-to-end decode latency.
## Checklist
- [x] Tests added/updated
- [x] Documentation updated
- [x] No breaking changes (global-only behavior preserved; quantized
paths unchanged)
- [x] CI passes
---------
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
### Description <!-- Describe your changes. --> The webgpu-local-testing skill is failing to load because of invalid YAML in its frontmatter. The unquoated description: value contained colon-space sequences ( SCOPE: lavapipe , e.g.: ), and according to Copilot, in YAML a plain (unquoted) scalar can't contain ": ". The parser reads it as a nested mapping key and aborts with: ScannerError: mapping values are not allowed here It was the only one of the 8 skills with this pattern, which is why every other skill loaded fine. The fix is to wrap the description value in double quotes and adjust `SCOPE:` to `SCOPE -` so the colons are treated as literal text. The frontmatter now parses, with both required keys (name, description) intact. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> The Copilot CLI was flagging this skill as failing to load, so this change attempts to resolve that error. Co-authored-by: Aditya Rastogi <adityar@ntdev.microsoft.com>
### Description The ONNX 1.22 release is returning 27 with the API `onnx_opset_version()` and this is the latest "in development" opset in ONNX and not the released opset version. This breaks tests in ORT as there is a validation check. So adjust the tests so that the test models are stamped with the latest release opset version. ### Motivation and Context Fix packaging pipeline break Successful run - https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=1277057&view=results
### Description Fuse the MoE router `MatMulNBits + Add([32] bias)` pattern into the CUDA `MatMulNBits` router GEMV path. This PR keeps the public surface conservative: - no QMoE op schema change; - no new router/top-k QMoE inputs; - the optimized path is exact-shape gated to the GPT-OSS router projection: `M=1`, `N=32`, `K=2880`, 4-bit weights, `block_size=32`, no zero points; - all other `MatMulNBits` shapes continue to use the existing generic path; - `ORT_DISABLE_QMOE_ROUTER_GEMV_SPECIALIZATION=1` disables the exact router GEMV specialization; - `ORT_DISABLE_QMOE_ROUTER_BIAS_FUSION=1` disables only the graph rewrite that folds the router bias into `MatMulNBits`. ### Motivation and Context GPT-OSS-20B decode runs a tiny router projection before each QMoE node. The router projection is an exact-shape int4 `MatMulNBits`, followed by a `[32]` bias add before `QMoE` consumes the router logits. The existing generic int4 GEMV works, but this router shape is small enough that specializing it reduces router GEMV overhead. Once that specialization is active, folding the `[32]` bias into the same kernel removes the remaining router-side `Add` launch without changing the QMoE op contract. ### Key Changes - Adds an exact-shape CUDA router GEMV specialization in `MatMulFloatInt4RouterKernel`. - Extends the CUDA `MatMulNBits` path to pass an optional bias pointer to the router specialization. - Extends `MatMulNBitsFusion` to rewrite the exact GPT-OSS router `MatMulNBits + Add` chain into biased `MatMulNBits`. - Keeps the transformer registration compatible with the current `origin/main` WebGPU kernel-gated MatMulNBits fusion logic. - Adds graph transformer and MatMul4Bits coverage for the specialization, fallback, and bias-fusion opt-out behavior. - Records the router GEMV and router bias fusion measurements in the QMoE GEMV experiment log. ### Validation Completed locally on the clean PR branch: - `lintrunner -a docs/contrib_ops/cuda/qmoe_gemv_experiments.md onnxruntime/contrib_ops/cuda/quantization/matmul_4bits.cu onnxruntime/contrib_ops/cuda/quantization/matmul_nbits.cc onnxruntime/contrib_ops/cuda/quantization/matmul_nbits.cuh onnxruntime/core/optimizer/graph_transformer_utils.cc onnxruntime/core/optimizer/matmul_nbits_fusion.cc onnxruntime/test/contrib_ops/cuda_kernels/fpA_intB_gemm_kernel_test.cc onnxruntime/test/contrib_ops/matmul_4bits_test.cc onnxruntime/test/optimizer/graph_transform_test.cc` - `git diff --check` - `git diff --cached --check` Previously collected on the experiment branch before preparing this PR branch: - Graph transformer tests for router GEMV/bias fusion passed. - MatMul4Bits provider coverage for router GEMV specialization/fallback passed. - Nsight confirmed the exact router specialization dispatches for GPT-OSS decode router projections. - CUDA-graph GPT-OSS decode A/B showed the router GEMV specialization at about `+1.6%` to `+1.8%` throughput. - Router bias fusion removed all 24 real GPT-OSS router bias `Add` nodes and measured about `+0.2%` throughput after the router GEMV specialization. Compiled C++ tests were not rerun from this new worktree because it does not have a configured build directory; CI should provide the full compiled validation matrix.
…ft#29021) ### Description Add CPU time offset to WebGPU GPU profiling timestamps so they align with the ORT profiler's time base (microseconds since profiling start). Previously GPU events started from 0, causing misalignment in trace viewers. ### Motivation and Context See above.
…ft#29017) ### Description The native WebGPU EP already supports the buffer cache mode options (`ep.webgpuexecutionprovider.storageBufferCacheMode` and friends), but onnxruntime-web never forwarded them from `executionProviders`, so they were unreachable from JS. This adds `storageBufferCacheMode`, `uniformBufferCacheMode`, `queryResolveBufferCacheMode` and `defaultBufferCacheMode` to `WebGpuExecutionProviderOption` and forwards them to the EP the same way `validationMode` is forwarded today, with the values validated against the set the native side accepts. The options ride the existing `SessionOptionsAppendExecutionProvider` path, which prefixes each key into exactly the config entry the EP reads, so no native changes are needed. ### Motivation and Context Fixes microsoft#29016. For static shape models, `storageBufferCacheMode: 'simple'` reuses exact size buffers across runs instead of allocating new bucket sized ones, which the issue's repro shows cutting peak WebGPU memory by about 27 percent. Verified locally with tsc builds of js/common and js/web, prettier and eslint, the js/common unit tests, and type level checks that the new options compile and invalid values are rejected. --------- Signed-off-by: Samaresh Kumar Singh <ssam3003@gmail.com>
### Description
Fixes two NVCC 13.3 (`cudafe++` / EDG front-end) parse regressions that
break the Linux CUDA build of ONNX Runtime. Both are host-side parser
bugs in the CUDA 13.3 toolkit that reject valid C++ which compiles fine
on CUDA 13.2 and earlier.
1. **Abseil member alias template.** NVCC 13.3 mis-parses the
qualified-id `IfRRef<...>::AddPtr<Other>` used inside abseil's
`insert_or_assign` / `try_emplace` macros, failing with `using template
type parameter ... after 'typename'`. A new patch introduces a top-level
alias template `IfRRefAddPtr<T, Other>` and routes the macros through
it. Because it stays an alias template, substitution remains in the
immediate context, so forming a pointer-to-reference is still a soft
(SFINAE) failure rather than a hard error — the original behavior is
preserved.
2. **CCCL global-qualified partial specializations.**
`<cub/device/device_transform.cuh>` and
`<cub/device/dispatch/tuning/tuning_transform.cuh>` declare `struct
::cuda::proclaims_copyable_arguments<...> : ::cuda::std::true_type {};`
at global scope, which NVCC 13.3 rejects with `global qualification of
class name is invalid before ':' token`. Since the affected headers ship
inside the (often read-only) CUDA toolkit, the build now generates
corrected copies — rewriting the specializations into namespace-reopened
form (`_CCCL_BEGIN_NAMESPACE_CUDA ... _CCCL_END_NAMESPACE_CUDA`) — into
the build tree and places that directory ahead of the toolkit CCCL
include path. The transform is a no-op on toolkits that do not contain
the offending pattern, so it is safe to keep enabled across CUDA
versions.
### Summary of changes
| File | Change |
|------|--------|
| `cmake/patches/abseil/absl_cuda13_member_template.patch` | New patch
adding the `IfRRefAddPtr` alias template and rewriting the abseil
container macros to use it. |
| `cmake/vcpkg-ports/abseil/absl_cuda13_member_template.patch` | Same
patch copied into the vcpkg overlay port (vcpkg looks for patches in the
port directory). |
| `cmake/vcpkg-ports/abseil/portfile.cmake` | Add the new patch to the
abseil overlay port `PATCHES` list. |
| `cmake/external/abseil-cpp.cmake` | Apply the new patch in the
non-vcpkg FetchContent path (both Windows and non-Windows branches). |
| `cmake/onnxruntime_providers_cuda.cmake` | Add
`ort_cuda13_patch_cccl_header()` and, for CUDA >= 13.0, generate fixed
CCCL headers into the build tree and prepend that directory to the CUDA
include path. |
### Motivation and Context
The CUDA 13.3 toolkit introduced `cudafe++` parser regressions that
reject valid template code accepted by CUDA 13.2 and earlier, so the
Linux CUDA build fails before producing any libraries. These workarounds
restore the build on CUDA 13.3 while remaining no-ops on toolkits
without the regressions, so existing CUDA versions are unaffected.
- Related upstream issue:
abseil/abseil-cpp#2075
### How was this tested?
- Full Linux build with CUDA 13.3 + cuDNN 9.23
(`CMAKE_CUDA_ARCHITECTURES="89;90"`, Release) completes successfully and
produces the `onnxruntime_gpu` wheel; the two previously-failing
translation units (`bias_softmax_impl.cu` and `moe_kernel.cu`) now
compile.
- The CMake-generated CCCL headers were verified byte-identical to a
manually-fixed reference that compiles the affected files with `exit 0`.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Automated daily backmerge from ORT main to ovep-develop. No conflicts detected. Do NOT squash or rebase - use merge commit only.