Add tuned HIP GiMMiK preload-C and width variants with non-temporal loads and stores by tomjen12 · Pull Request #19 · PyFR/GiMMiK

tomjen12 · 2026-06-18T05:44:03Z

Summary

Adds tuned HIP GiMMiK preload and vector-width variants for PyFR-style matrix multiplication cases.

The new variants include C-preload kernels for cstream/bstream and aligned width-2 preload variants. Existing and new HIP templates now use shared non-temporal load/store helpers for C accesses.

Results

On MI300X PyFR GiMMiK matmul benchmarks across 53 double-precision p2-p5 cases, bandwidth efficiency versus a 4.4 TB/s target improved from:

Baseline: 76% min, 84% geomean
This branch: 86% min, 94% geomean

Companion PyFR PR: PyFR/PyFR#567

Test plan

Ran PyFR HIP GiMMiK matmul correctness checks against NumPy reference results.
Ran MI300X double-precision p2-p5 benchmark cases.

tomjen12 · 2026-06-18T06:20:18Z

Hi @FreddieWitherden , could you please review this PR when you have a chance?
I also noticed I cannot request reviewers from the sidebar. Is reviewer assignment limited to repository members?

FreddieWitherden · 2026-06-18T12:56:39Z

+          if A[j, kx] != 0:
+              nzixs.append((l_idx, kx))
+
+      has_dotp = any(A[j, kx] != 0 for kx in range(k))


As A is a numpy array this can likely be simplified as A[j].any() (I think!). Same for the above for where we can eliminate the explicit Python loop with something built upon A[j, kbx].

FreddieWitherden · 2026-06-18T12:57:47Z

+
+% if width == 2:
+static inline __device__ ${dtype}
+gimmik_vmul(${dtype[:-1]} a, ${dtype} b)


Might be worth moving these up into a shared file which other kernels can include via Mako. Keeps all of the vector stuff in one place.

FreddieWitherden · 2026-06-18T13:05:49Z

Overall looks good. Few quick things:

Confirm nothing breaks on the consumer SKU's. Performance is not critical here but some of our users use RDNA GPUs for local development and wave32 vs 64 has tripped us up in the past.
As we're now emitting more kernels this can have a negative impact on the start-up time of PyFR when they are JIT'ed for the first time. If you have reasonable heuristics it might be adopting the approach of PTX Backend #18 which has JSON files with tuning info (and am working on that PR to get the code moved into the base class so all generators can share it). PyFR passes in the wave_size and GCN arch name (queried from HIP) to GiMMiK which can be used for dispatch. This is especially useful if a certain trick always yields a benefit.

FreddieWitherden · 2026-06-18T14:59:02Z

Couple of other notes (mostly for me):

Confirm compatibility with PyFR v3.1 which is the latest released version. This only uses the static n kernel variants so shouldn't have any issues with wide width kernels (since GiMMiK provides the exact block sizes).
Confirm no regressions on MI250X which is widely used due to its deployment on supercomputers in the US and Europe. It is fine if performance doesn't improve, but we want to avoid regressions.

tomjen12 added 2 commits June 18, 2026 00:06

Add tuned HIP GiMMiK preload variants

8e23d63

Fix HIP GiMMiK block size metadata

8f4d03e

tomjen12 marked this pull request as draft June 18, 2026 05:47

tomjen12 marked this pull request as ready for review June 18, 2026 05:48

tomjen12 mentioned this pull request Jun 18, 2026

Support HIP GiMMiK vector-width metadata PyFR/PyFR#567

Open

FreddieWitherden reviewed Jun 18, 2026

View reviewed changes

FreddieWitherden self-assigned this Jun 18, 2026

tomjen12 added 2 commits June 21, 2026 23:18

Address HIP GiMMiK review comments

96671a6

Handle ROCm feature suffixes for gfx942 tuning

739a82e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tuned HIP GiMMiK preload-C and width variants with non-temporal loads and stores#19

Add tuned HIP GiMMiK preload-C and width variants with non-temporal loads and stores#19
tomjen12 wants to merge 4 commits into
PyFR:masterfrom
tomjen12:hip-gimmik-optimized

tomjen12 commented Jun 18, 2026 •

edited

Loading

Uh oh!

tomjen12 commented Jun 18, 2026

Uh oh!

FreddieWitherden Jun 18, 2026

Uh oh!

FreddieWitherden Jun 18, 2026

Uh oh!

FreddieWitherden commented Jun 18, 2026

Uh oh!

FreddieWitherden commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tomjen12 commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Results

Test plan

Uh oh!

tomjen12 commented Jun 18, 2026

Uh oh!

FreddieWitherden Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

FreddieWitherden Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

FreddieWitherden commented Jun 18, 2026

Uh oh!

FreddieWitherden commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tomjen12 commented Jun 18, 2026 •

edited

Loading