Sync msft 25062026#1163
Closed
hdharpure9922 wants to merge 6 commits into
Closed
Conversation
…icrosoft#28771) ### Description <!-- Describe your changes. --> Relax the input-validation in OrtApi::CompileModel to accept OrtModel instances with zero graph inputs. Previously, ModelCompilationOptions::Check() rejected such models with "OrtModel graph must have at least one input and one output defined." The check now requires only at least one graph output; the zero-input case is legal. Tests in test_model_builder_api.cc are restructured: - The old CompileFromModelWithEmptyInputsOutputs_Fails is renamed to CompileFromModelWithEmptyOutputs_Fails and reshaped to provide 1 input + 0 outputs, isolating the output-only check. - A new regression test CompileFromModelWithEmptyInputs_Succeeds builds a 0-input model with a RandomNormal node and verifies compilation succeeds. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Fixes microsoft#28135 The original check was too restrictive and impacts callers (e.g., WebNN/Chromium needs to call CompileModel on such models in a separate compiler process (and then load the compiled artifact via CreateSessionFromArray in the GPU process)).
…ttention (microsoft#29240) ### Description The CUDA `GroupQueryAttention` kernel derives a KV-cache append offset from the `seqlens_k` input (`past_seq_lens = (seqlens_k + 1) - sequence_length`). On the CUDA EP `seqlens_k` is device-resident (only `total_sequence_length` is a CPU input), so the host-side range validation in the operator/helper is skipped. The device kernel `UnpackRoPEAppend` then guarded the cache store with only a one-sided upper bound (`cache_s < max_seqlen`), so an out-of-range `seqlens_k` could produce a negative offset that is sign-extended into the cache-index arithmetic. The CPU operator already validates `seqlens_k` host-side; this change brings the CUDA path to parity by guarding on the device. ### Changes - `group_query_attention_impl.cu` (`GetSequenceLengths`): clamp the negative case at the source so both `total_seq_lens` and the append offset `past_seq_lens` stay non-negative for all downstream consumers. - `group_query_attention_qkv.cuh` (`UnpackRoPEAppend`): make the KV-cache store bound two-sided (`cache_s >= 0 && cache_s < max_seqlen`), mirroring the existing position-index guard a few lines above. This also covers the fast-decode path, where `past_seq_lens` points directly at the raw input and bypasses `GetSequenceLengths`. - Added `NegativeSeqlensK_CacheAppend_NoOOB_CUDA` regression test exercising the KV-cache append path with an out-of-range `seqlens_k` (CUDA-guarded; skips when CUDA EP is unavailable). ### Notes - The two-sided guard matches the pattern introduced for the rotary position index in microsoft#27597. - CPU is unaffected (already validated host-side); WebGPU relies on the CPU-validated `total_sequence_length`. The CUDA implementation is shared with ROCm via hipify. - The regression is a device-memory write best observed under `compute-sanitizer`; the test asserts the run completes with finite outputs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…oft#28962) ## Summary Adds an FP32 flash attention path for the CPU `com.microsoft.GroupQueryAttention` (GQA) contrib op, mirroring the existing quantized-KV flash attention path. The new tiled, online-softmax kernel avoids materializing the full `[S, T]` attention score matrix. It is restricted to prefill / chunked-prefill (`sequence_length > 1`); single-token decode falls back to the naive path. With causal early-termination it is faster than the naive path across all measured prefill lengths while using a fraction of the memory. ## Key changes - **New MLAS kernel** `onnxruntime/core/mlas/lib/flashattn_gqa.cpp` (`MlasFlashAttentionGQA`): - Tiled QK / softmax / SV with online-softmax (running max/sum rescaling). - GQA head grouping (`num_heads % kv_num_heads == 0`), causal masking, local window, additive attention bias, and packed-QKV input. - **Causal early-termination**: during prefill, KV blocks that fall entirely in the causally masked upper triangle are skipped (`break` once `ir >= past_seqlen + q_idx + row_size_q`), avoiding the wasted QK/SV GEMMs over roughly half of the square prefill attention matrix. - Per-batch invocation for ragged / shared-buffer `seqlens_k`. - **MLAS API** `onnxruntime/core/mlas/inc/mlas.h`: new `MlasFlashAttentionGQAArgs` struct and `MlasFlashAttentionGQA` declaration. - **Dispatch** `onnxruntime/contrib_ops/cpu/bert/gqa_attention_base.h`: new `ApplyAttentionFlash` that concatenates new K/V into the FP32 present cache and invokes the kernel. The per-thread scratch buffer size is computed with `SafeInt<size_t>` to guard against `size_t` overflow on large/malformed shapes before allocation. - **Wiring** `onnxruntime/contrib_ops/cpu/bert/group_query_attention.cc`: float-only flash dispatch, active only for prefill (`sequence_length > 1`) and when `softcap == 0`, no smooth softmax, no head sink, no QK output; falls back to the naive path otherwise. The existing `ORT_GQA_DISABLE_FLASH_ATTENTION` env var disables it. - **CMake** `cmake/onnxruntime_mlas.cmake`: register the new source file. - **Docs** `docs/contrib_ops/cpu/gqa.md`: document the non-quantized flash attention path, activation conditions, causal early-termination, file list, and FP32 flash-vs-naive benchmark results. - **Benchmark** `onnxruntime/test/python/transformers/benchmark_gqa_cpu_flash.py`: add an FP32 (non-quantized) mode (`--fp32`) for operator-level flash-vs-naive comparison. ### Why prefill-only (`sequence_length > 1`) Single-token decode (`sequence_length == 1`) produces only a `[1, total_sequence_length]` score row per head, so there is nothing to tile away and the extra online-softmax bookkeeping makes the flash kernel slower and noisier than naive in practice. Restricting the flash path to prefill keeps the consistent prefill win without regressing decode. Because decode is excluded, the two-phase flash-decoding kernels are unreachable and have been removed for a smaller, simpler implementation. `float16` continues to use the naive path (the kernel is float-only, matching the quantized flash constraint). ## Performance Operator-level, AMD EPYC 7763 (16 physical cores), threads=8, FP32 KV cache, `B=1, num_heads=16, kv_num_heads=8, head_size=128`. Flash is faster than naive across all measured prefill lengths (and single-threaded as well, 1.4-1.8x), confirming the gain is algorithmic - the causal early-termination removes the wasted upper-triangle work that previously made flash slower than naive at short sequences. | Prefill Seq Length | Naive (ms) | Flash (ms) | Speedup | |---:|---:|---:|---:| | 512 | 5.8-8.4 | 4.2-5.3 | 1.4-1.6x | | 1024 | 25-29 | 13-18 | 1.6-2.0x | | 2048 | 87-118 | 52-65 | 1.5-2.0x | | 4096 | 365-380 | 213-234 | 1.6-1.7x | The flash path's primary structural benefit is memory: it never allocates the full O(N x S x T) attention matrix (~1 GB at S=4096, N=16) and instead uses an O(S x Bc) per-thread tile. ## Testing - **C++ op tests**: `onnxruntime_provider_test --gtest_filter="GroupQueryAttentionTest.*"` - 38 passed (12 GPU/WebGPU skipped) with flash on (default) and with `ORT_GQA_DISABLE_FLASH_ATTENTION=1`. - **Flash vs. naive parity** (FP32): output of the flash path matches the naive path (max abs diff ~1e-7) across prefill (block-aligned and non-aligned `S`), MHA and GQA head ratios, and local window. Decode now uses the naive path on both sides (diff 0). - **Python parity** (`test_gqa_cpu.py`, flash vs. naive reference): focused FP32 sweep of 600 prompt configurations covering all head sizes (32-256), GQA ratios `(6,6)/(6,3)/(9,9)/(9,3)`, batches `1/3/5`, causal/local window, attention bias, position ids, packed QKV, and with/without KV buffer - all passed. The official `test_gqa_cpu.py` suite passes. Two correctness bugs were found and fixed via the parity sweep while developing this path: 1. Attention-bias batch stride ignored head broadcasting for `[batch, 1, S, T]` bias. 2. Query batch stride was hardcoded to `num_heads * S * H`, which is incorrect for packed-QKV input (correct stride is `(num_heads + 2 * kv_num_heads) * S * H`).
…, GQA underflow, and ep_weight_sharing_ctx_gen build (microsoft#28245) ### Description This PR contains three commits: **Commit 1: Miscellaneous fixes** - Downgrade QNN ETW profiling mismatch logs from ERROR to VERBOSE to reduce excessive telemetry noise (~1 billion events/week across Windows devices) - Add bounds checking in GQA attention to prevent `size_t` underflow when `seqlens_k` contains invalid data (fixes microsoft#27170) - Build `ep_weight_sharing_ctx_gen` for TensorRT, OpenVINO, and VitisAI in addition to QNN **Commit 2: Bump cpuinfo and add `cpuinfo_deinitialize()` integration** Applications that dynamically load and unload the onnxruntime DLL leave orphaned heap allocations from cpuinfo when the library is unloaded mid-process. These are flagged as memory leaks by App Verifier, Valgrind, AddressSanitizer, and LeakSanitizer. This commit bumps `pytorch/cpuinfo` to a version that implements `cpuinfo_deinitialize()` ([pytorch/cpuinfo#387](pytorch/cpuinfo#387)) and adds ORT integration: - `CPUIDInfo::ShutDown()` calls `cpuinfo_deinitialize()` to free heap-allocated globals - `DllMain` calls `ShutdownCpuInfo()` on `DLL_PROCESS_DETACH` - In memleak-check builds, shutdown also runs during process termination - `InstanceCreated` atomic guard prevents singleton creation during DLL unload **Commit 3: Update to official cpuinfo merged fix** After [pytorch/cpuinfo#387](pytorch/cpuinfo#387) merged upstream, updated the dependency to point to `pytorch/cpuinfo` main (`4628dc06`). Patch changes: - **Removed** `win_arm_fp16_detection_fallback.patch` — upstreamed via [pytorch/cpuinfo#348](pytorch/cpuinfo#348) - **Updated** `patch_vcpkg_arm64ec_support.patch` — regenerated for new cpuinfo; still needed ([pytorch/cpuinfo#324](pytorch/cpuinfo#324) not yet merged) - **Updated** `patch_cpuinfo_h_for_arm64ec.patch` — retained, not yet upstream - **Regenerated** `fix_missing_sysfs_fallback.patch` — updated context lines for new cpuinfo code ### Motivation and Context - pytorch/cpuinfo#150 - microsoft#16117 - microsoft#23762
…icrosoft#29221) ## Description The CUDA plugin EP previously rejected combining a user-provided compute stream (`user_compute_stream`) with CUDA graph capture (`enable_cuda_graph`), returning `ORT_INVALID_ARGUMENT`. This PR removes that restriction so the two options can be used together: when both are set, graph capture and replay run on the user-owned stream (the same stream the kernels are issued to), matching the bundled (non-plugin) CUDA EP behavior. Several supporting fixes make capture on a shared stream stable and Memcpy-free. ## Summary of Changes ### Allow user stream + CUDA graph | File | Change | |------|--------| | [onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.cc](onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.cc) | Remove the validation that rejected `user_compute_stream` + `enable_cuda_graph` together. | | [onnxruntime/core/providers/cuda/plugin/cuda_ep.cc](onnxruntime/core/providers/cuda/plugin/cuda_ep.cc) | `PerThreadContext` accepts an optional external graph stream. When both options are set it captures/replays on the user stream and does **not** create or destroy it (the user owns its lifetime); otherwise it owns a dedicated graph stream as before. | ### Stable, Memcpy-free CUDA graph capture | File | Change | |------|--------| | [onnxruntime/core/providers/cuda/plugin/cuda_kernel_adapter.h](onnxruntime/core/providers/cuda/plugin/cuda_kernel_adapter.h) | Route kernel scratch/workspace allocations through the EP allocator (BFC arena) instead of raw `cudaMallocAsync`/`cudaMalloc`. After warmup the arena reaches steady state, so the capture run serves scratch from already-reserved chunks and the device free-memory footprint stays stable — required for correct capture. Matches the built-in CUDA EP. | | [onnxruntime/core/providers/cuda/tensor/shape_op.cc](onnxruntime/core/providers/cuda/tensor/shape_op.cc) | Add an adapter-based `Shape` kernel under `#ifdef BUILD_CUDA_EP_AS_PLUGIN` with identical semantics to the CPU `Shape`. Registering `Shape` on the EP keeps it off the CPU EP and avoids the Memcpy nodes that would otherwise break CUDA graph capture. | | [cmake/onnxruntime_providers_cuda_plugin.cmake](cmake/onnxruntime_providers_cuda_plugin.cmake) | Stop excluding `shape_op.cc` from the plugin build so the adapter-based `Shape` kernel is compiled in. | ### Null-allocator fallback in PrePack (plugin boundary) In the plugin build the `AllocatorPtr` passed to `PrePack` can arrive null across the library boundary. Each kernel now falls back to its own default-memory allocator (`Info().GetAllocator(OrtMemTypeDefault)`), which is always valid. - [onnxruntime/contrib_ops/cuda/bert/group_query_attention.cc](onnxruntime/contrib_ops/cuda/bert/group_query_attention.cc) - [onnxruntime/contrib_ops/cuda/moe/moe_quantization.cc](onnxruntime/contrib_ops/cuda/moe/moe_quantization.cc) - [onnxruntime/contrib_ops/cuda/quantization/matmul_nbits.cc](onnxruntime/contrib_ops/cuda/quantization/matmul_nbits.cc) ### Misc - [onnxruntime/core/framework/session_state.cc](onnxruntime/core/framework/session_state.cc) — wrap a long line (no behavior change). ## Testing - New test: [onnxruntime/test/providers/cuda/plugin/cuda_plugin_user_stream_graph_test.cc](onnxruntime/test/providers/cuda/plugin/cuda_plugin_user_stream_graph_test.cc) covering: 1. Session creation succeeds with both `user_compute_stream` and `enable_cuda_graph` set (regression for the removed validation). 2. Capture + replay on the user stream produce correct results. 3. Replay after an in-place input update on the user stream is correct. - Tests are gated on `ORT_UNIT_TEST_HAS_CUDA_PLUGIN_EP` and skip gracefully when no CUDA device or plugin library is available. ## Motivation and Context Users that drive ORT from their own CUDA stream (e.g. to interleave ORT inference with their own kernels) previously could not also benefit from CUDA graph capture on the plugin EP. This change brings the plugin EP to parity with the bundled CUDA EP for that workflow. ## Checklist - [x] Tests added/updated - [x] No breaking changes (relaxes a previously rejected option combination) - [ ] Documentation updated (if applicable)
## Summary - align CPU ONNX Attention causal masking with upper-left behavior for q_len=1, kv_len>1, no past - preserve the existing `nonpad_kv_seqlen` / TensorScatter single-query causal behavior - update Python attention reference causal mask to model ONNX upper-left alignment with an explicit past offset - add a regression test for issue microsoft#29020 Fixes microsoft#29020 ## Validation - `python -m py_compile onnxruntime/test/python/transformers/test_onnx_attention/common.py onnxruntime/test/python/transformers/test_onnx_attention/test_mha.py onnxruntime/test/python/transformers/test_onnx_attention/test_gqa.py onnxruntime/test/python/transformers/test_onnx_attention/test_tensorscatter_attention.py` - `git diff --check` Notes: - `pytest onnxruntime/test/python/transformers/test_onnx_attention/test_tensorscatter_attention.py -k "cpu_fp32 and causal" -q` could not run locally because this Python environment does not have `onnx` / `onnxruntime` installed. - After the latest follow-up commit, an incremental rebuild of `onnxruntime_provider_test` was attempted but failed in MSBuild before compiling this change due to a local environment issue: duplicate `Path` / `PATH` environment keys when launching `CL.exe`.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Backmerge master into ovep-develop to synchronize the latest upstream changes