Skip to content

Sync msft 25062026#1163

Closed
hdharpure9922 wants to merge 6 commits into
intel:ovep-developfrom
hdharpure9922:master
Closed

Sync msft 25062026#1163
hdharpure9922 wants to merge 6 commits into
intel:ovep-developfrom
hdharpure9922:master

Conversation

@hdharpure9922

Copy link
Copy Markdown

Backmerge master into ovep-develop to synchronize the latest upstream changes

adrastogi and others added 6 commits June 24, 2026 14:47
…icrosoft#28771)

### Description
<!-- Describe your changes. -->
Relax the input-validation in OrtApi::CompileModel to accept OrtModel
instances with zero graph inputs. Previously,
ModelCompilationOptions::Check() rejected such models with "OrtModel
graph must have at least one input and one output defined." The check
now requires only at least one graph output; the zero-input case is
legal.

Tests in test_model_builder_api.cc are restructured:

- The old CompileFromModelWithEmptyInputsOutputs_Fails is renamed to
CompileFromModelWithEmptyOutputs_Fails and reshaped to provide 1 input +
0 outputs, isolating the output-only check.
- A new regression test CompileFromModelWithEmptyInputs_Succeeds builds
a 0-input model with a RandomNormal node and verifies compilation
succeeds.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fixes microsoft#28135 
The original check was too restrictive and impacts callers (e.g.,
WebNN/Chromium needs to call CompileModel on such models in a separate
compiler process (and then load the compiled artifact via
CreateSessionFromArray in the GPU process)).
…ttention (microsoft#29240)

### Description

The CUDA `GroupQueryAttention` kernel derives a KV-cache append offset
from the `seqlens_k` input (`past_seq_lens = (seqlens_k + 1) -
sequence_length`). On the CUDA EP `seqlens_k` is device-resident (only
`total_sequence_length` is a CPU input), so the host-side range
validation in the operator/helper is skipped. The device kernel
`UnpackRoPEAppend` then guarded the cache store with only a one-sided
upper bound (`cache_s < max_seqlen`), so an out-of-range `seqlens_k`
could produce a negative offset that is sign-extended into the
cache-index arithmetic.

The CPU operator already validates `seqlens_k` host-side; this change
brings the CUDA path to parity by guarding on the device.

### Changes
- `group_query_attention_impl.cu` (`GetSequenceLengths`): clamp the
negative case at the source so both `total_seq_lens` and the append
offset `past_seq_lens` stay non-negative for all downstream consumers.
- `group_query_attention_qkv.cuh` (`UnpackRoPEAppend`): make the
KV-cache store bound two-sided (`cache_s >= 0 && cache_s < max_seqlen`),
mirroring the existing position-index guard a few lines above. This also
covers the fast-decode path, where `past_seq_lens` points directly at
the raw input and bypasses `GetSequenceLengths`.
- Added `NegativeSeqlensK_CacheAppend_NoOOB_CUDA` regression test
exercising the KV-cache append path with an out-of-range `seqlens_k`
(CUDA-guarded; skips when CUDA EP is unavailable).

### Notes
- The two-sided guard matches the pattern introduced for the rotary
position index in microsoft#27597.
- CPU is unaffected (already validated host-side); WebGPU relies on the
CPU-validated `total_sequence_length`. The CUDA implementation is shared
with ROCm via hipify.
- The regression is a device-memory write best observed under
`compute-sanitizer`; the test asserts the run completes with finite
outputs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…oft#28962)

## Summary

Adds an FP32 flash attention path for the CPU
`com.microsoft.GroupQueryAttention` (GQA) contrib op, mirroring the
existing quantized-KV flash attention path. The new tiled,
online-softmax kernel avoids materializing the full `[S, T]` attention
score matrix. It is restricted to prefill / chunked-prefill
(`sequence_length > 1`); single-token decode falls back to the naive
path. With causal early-termination it is faster than the naive path
across all measured prefill lengths while using a fraction of the
memory.

## Key changes

- **New MLAS kernel** `onnxruntime/core/mlas/lib/flashattn_gqa.cpp`
(`MlasFlashAttentionGQA`):
- Tiled QK / softmax / SV with online-softmax (running max/sum
rescaling).
- GQA head grouping (`num_heads % kv_num_heads == 0`), causal masking,
local window, additive attention bias, and packed-QKV input.
- **Causal early-termination**: during prefill, KV blocks that fall
entirely in the causally masked upper triangle are skipped (`break` once
`ir >= past_seqlen + q_idx + row_size_q`), avoiding the wasted QK/SV
GEMMs over roughly half of the square prefill attention matrix.
  - Per-batch invocation for ragged / shared-buffer `seqlens_k`.
- **MLAS API** `onnxruntime/core/mlas/inc/mlas.h`: new
`MlasFlashAttentionGQAArgs` struct and `MlasFlashAttentionGQA`
declaration.
- **Dispatch** `onnxruntime/contrib_ops/cpu/bert/gqa_attention_base.h`:
new `ApplyAttentionFlash` that concatenates new K/V into the FP32
present cache and invokes the kernel. The per-thread scratch buffer size
is computed with `SafeInt<size_t>` to guard against `size_t` overflow on
large/malformed shapes before allocation.
- **Wiring**
`onnxruntime/contrib_ops/cpu/bert/group_query_attention.cc`: float-only
flash dispatch, active only for prefill (`sequence_length > 1`) and when
`softcap == 0`, no smooth softmax, no head sink, no QK output; falls
back to the naive path otherwise. The existing
`ORT_GQA_DISABLE_FLASH_ATTENTION` env var disables it.
- **CMake** `cmake/onnxruntime_mlas.cmake`: register the new source
file.
- **Docs** `docs/contrib_ops/cpu/gqa.md`: document the non-quantized
flash attention path, activation conditions, causal early-termination,
file list, and FP32 flash-vs-naive benchmark results.
- **Benchmark**
`onnxruntime/test/python/transformers/benchmark_gqa_cpu_flash.py`: add
an FP32 (non-quantized) mode (`--fp32`) for operator-level
flash-vs-naive comparison.

### Why prefill-only (`sequence_length > 1`)

Single-token decode (`sequence_length == 1`) produces only a `[1,
total_sequence_length]` score row per head, so there is nothing to tile
away and the extra online-softmax bookkeeping makes the flash kernel
slower and noisier than naive in practice. Restricting the flash path to
prefill keeps the consistent prefill win without regressing decode.
Because decode is excluded, the two-phase flash-decoding kernels are
unreachable and have been removed for a smaller, simpler implementation.

`float16` continues to use the naive path (the kernel is float-only,
matching the quantized flash constraint).

## Performance

Operator-level, AMD EPYC 7763 (16 physical cores), threads=8, FP32 KV
cache, `B=1, num_heads=16, kv_num_heads=8, head_size=128`. Flash is
faster than naive across all measured prefill lengths (and
single-threaded as well, 1.4-1.8x), confirming the gain is algorithmic -
the causal early-termination removes the wasted upper-triangle work that
previously made flash slower than naive at short sequences.

| Prefill Seq Length | Naive (ms) | Flash (ms) | Speedup |
|---:|---:|---:|---:|
| 512  | 5.8-8.4 | 4.2-5.3 | 1.4-1.6x |
| 1024 | 25-29   | 13-18   | 1.6-2.0x |
| 2048 | 87-118  | 52-65   | 1.5-2.0x |
| 4096 | 365-380 | 213-234 | 1.6-1.7x |

The flash path's primary structural benefit is memory: it never
allocates the full O(N x S x T) attention matrix (~1 GB at S=4096, N=16)
and instead uses an O(S x Bc) per-thread tile.

## Testing

- **C++ op tests**: `onnxruntime_provider_test
--gtest_filter="GroupQueryAttentionTest.*"` - 38 passed (12 GPU/WebGPU
skipped) with flash on (default) and with
`ORT_GQA_DISABLE_FLASH_ATTENTION=1`.
- **Flash vs. naive parity** (FP32): output of the flash path matches
the naive path (max abs diff ~1e-7) across prefill (block-aligned and
non-aligned `S`), MHA and GQA head ratios, and local window. Decode now
uses the naive path on both sides (diff 0).
- **Python parity** (`test_gqa_cpu.py`, flash vs. naive reference):
focused FP32 sweep of 600 prompt configurations covering all head sizes
(32-256), GQA ratios `(6,6)/(6,3)/(9,9)/(9,3)`, batches `1/3/5`,
causal/local window, attention bias, position ids, packed QKV, and
with/without KV buffer - all passed. The official `test_gqa_cpu.py`
suite passes.

Two correctness bugs were found and fixed via the parity sweep while
developing this path:
1. Attention-bias batch stride ignored head broadcasting for `[batch, 1,
S, T]` bias.
2. Query batch stride was hardcoded to `num_heads * S * H`, which is
incorrect for packed-QKV input (correct stride is `(num_heads + 2 *
kv_num_heads) * S * H`).
…, GQA underflow, and ep_weight_sharing_ctx_gen build (microsoft#28245)

### Description

This PR contains three commits:

**Commit 1: Miscellaneous fixes**
- Downgrade QNN ETW profiling mismatch logs from ERROR to VERBOSE to
reduce excessive telemetry noise (~1 billion events/week across Windows
devices)
- Add bounds checking in GQA attention to prevent `size_t` underflow
when `seqlens_k` contains invalid data (fixes microsoft#27170)
- Build `ep_weight_sharing_ctx_gen` for TensorRT, OpenVINO, and VitisAI
in addition to QNN

**Commit 2: Bump cpuinfo and add `cpuinfo_deinitialize()` integration**

Applications that dynamically load and unload the onnxruntime DLL leave
orphaned heap allocations from cpuinfo when the library is unloaded
mid-process. These are flagged as memory leaks by App Verifier,
Valgrind, AddressSanitizer, and LeakSanitizer.

This commit bumps `pytorch/cpuinfo` to a version that implements
`cpuinfo_deinitialize()`
([pytorch/cpuinfo#387](pytorch/cpuinfo#387)) and
adds ORT integration:
- `CPUIDInfo::ShutDown()` calls `cpuinfo_deinitialize()` to free
heap-allocated globals
- `DllMain` calls `ShutdownCpuInfo()` on `DLL_PROCESS_DETACH`
- In memleak-check builds, shutdown also runs during process termination
- `InstanceCreated` atomic guard prevents singleton creation during DLL
unload

**Commit 3: Update to official cpuinfo merged fix**

After [pytorch/cpuinfo#387](pytorch/cpuinfo#387)
merged upstream, updated the dependency to point to `pytorch/cpuinfo`
main (`4628dc06`).

Patch changes:
- **Removed** `win_arm_fp16_detection_fallback.patch` — upstreamed via
[pytorch/cpuinfo#348](pytorch/cpuinfo#348)
- **Updated** `patch_vcpkg_arm64ec_support.patch` — regenerated for new
cpuinfo; still needed
([pytorch/cpuinfo#324](pytorch/cpuinfo#324) not
yet merged)
- **Updated** `patch_cpuinfo_h_for_arm64ec.patch` — retained, not yet
upstream
- **Regenerated** `fix_missing_sysfs_fallback.patch` — updated context
lines for new cpuinfo code

### Motivation and Context

- pytorch/cpuinfo#150
- microsoft#16117
- microsoft#23762
…icrosoft#29221)

## Description

The CUDA plugin EP previously rejected combining a user-provided compute
stream
(`user_compute_stream`) with CUDA graph capture (`enable_cuda_graph`),
returning
`ORT_INVALID_ARGUMENT`. This PR removes that restriction so the two
options can
be used together: when both are set, graph capture and replay run on the
user-owned stream (the same stream the kernels are issued to), matching
the
bundled (non-plugin) CUDA EP behavior. Several supporting fixes make
capture on a
shared stream stable and Memcpy-free.

## Summary of Changes

### Allow user stream + CUDA graph

| File | Change |
|------|--------|
|
[onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.cc](onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.cc)
| Remove the validation that rejected `user_compute_stream` +
`enable_cuda_graph` together. |
|
[onnxruntime/core/providers/cuda/plugin/cuda_ep.cc](onnxruntime/core/providers/cuda/plugin/cuda_ep.cc)
| `PerThreadContext` accepts an optional external graph stream. When
both options are set it captures/replays on the user stream and does
**not** create or destroy it (the user owns its lifetime); otherwise it
owns a dedicated graph stream as before. |

### Stable, Memcpy-free CUDA graph capture

| File | Change |
|------|--------|
|
[onnxruntime/core/providers/cuda/plugin/cuda_kernel_adapter.h](onnxruntime/core/providers/cuda/plugin/cuda_kernel_adapter.h)
| Route kernel scratch/workspace allocations through the EP allocator
(BFC arena) instead of raw `cudaMallocAsync`/`cudaMalloc`. After warmup
the arena reaches steady state, so the capture run serves scratch from
already-reserved chunks and the device free-memory footprint stays
stable — required for correct capture. Matches the built-in CUDA EP. |
|
[onnxruntime/core/providers/cuda/tensor/shape_op.cc](onnxruntime/core/providers/cuda/tensor/shape_op.cc)
| Add an adapter-based `Shape` kernel under `#ifdef
BUILD_CUDA_EP_AS_PLUGIN` with identical semantics to the CPU `Shape`.
Registering `Shape` on the EP keeps it off the CPU EP and avoids the
Memcpy nodes that would otherwise break CUDA graph capture. |
|
[cmake/onnxruntime_providers_cuda_plugin.cmake](cmake/onnxruntime_providers_cuda_plugin.cmake)
| Stop excluding `shape_op.cc` from the plugin build so the
adapter-based `Shape` kernel is compiled in. |

### Null-allocator fallback in PrePack (plugin boundary)

In the plugin build the `AllocatorPtr` passed to `PrePack` can arrive
null across
the library boundary. Each kernel now falls back to its own
default-memory
allocator (`Info().GetAllocator(OrtMemTypeDefault)`), which is always
valid.

-
[onnxruntime/contrib_ops/cuda/bert/group_query_attention.cc](onnxruntime/contrib_ops/cuda/bert/group_query_attention.cc)
-
[onnxruntime/contrib_ops/cuda/moe/moe_quantization.cc](onnxruntime/contrib_ops/cuda/moe/moe_quantization.cc)
-
[onnxruntime/contrib_ops/cuda/quantization/matmul_nbits.cc](onnxruntime/contrib_ops/cuda/quantization/matmul_nbits.cc)

### Misc

-
[onnxruntime/core/framework/session_state.cc](onnxruntime/core/framework/session_state.cc)
— wrap a long line (no behavior change).

## Testing

- New test:
[onnxruntime/test/providers/cuda/plugin/cuda_plugin_user_stream_graph_test.cc](onnxruntime/test/providers/cuda/plugin/cuda_plugin_user_stream_graph_test.cc)
covering:
1. Session creation succeeds with both `user_compute_stream` and
`enable_cuda_graph` set (regression for the removed validation).
  2. Capture + replay on the user stream produce correct results.
3. Replay after an in-place input update on the user stream is correct.
- Tests are gated on `ORT_UNIT_TEST_HAS_CUDA_PLUGIN_EP` and skip
gracefully when no CUDA device or plugin library is available.

## Motivation and Context

Users that drive ORT from their own CUDA stream (e.g. to interleave ORT
inference
with their own kernels) previously could not also benefit from CUDA
graph capture
on the plugin EP. This change brings the plugin EP to parity with the
bundled
CUDA EP for that workflow.

## Checklist

- [x] Tests added/updated
- [x] No breaking changes (relaxes a previously rejected option
combination)
- [ ] Documentation updated (if applicable)
## Summary
- align CPU ONNX Attention causal masking with upper-left behavior for
q_len=1, kv_len>1, no past
- preserve the existing `nonpad_kv_seqlen` / TensorScatter single-query
causal behavior
- update Python attention reference causal mask to model ONNX upper-left
alignment with an explicit past offset
- add a regression test for issue microsoft#29020

Fixes microsoft#29020

## Validation
- `python -m py_compile
onnxruntime/test/python/transformers/test_onnx_attention/common.py
onnxruntime/test/python/transformers/test_onnx_attention/test_mha.py
onnxruntime/test/python/transformers/test_onnx_attention/test_gqa.py
onnxruntime/test/python/transformers/test_onnx_attention/test_tensorscatter_attention.py`
- `git diff --check`

Notes:
- `pytest
onnxruntime/test/python/transformers/test_onnx_attention/test_tensorscatter_attention.py
-k "cpu_fp32 and causal" -q` could not run locally because this Python
environment does not have `onnx` / `onnxruntime` installed.
- After the latest follow-up commit, an incremental rebuild of
`onnxruntime_provider_test` was attempted but failed in MSBuild before
compiling this change due to a local environment issue: duplicate `Path`
/ `PATH` environment keys when launching `CL.exe`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants