Skip to content

Add tuned HIP GiMMiK preload-C and width variants with non-temporal loads and stores#19

Open
tomjen12 wants to merge 4 commits into
PyFR:masterfrom
tomjen12:hip-gimmik-optimized
Open

Add tuned HIP GiMMiK preload-C and width variants with non-temporal loads and stores#19
tomjen12 wants to merge 4 commits into
PyFR:masterfrom
tomjen12:hip-gimmik-optimized

Conversation

@tomjen12

@tomjen12 tomjen12 commented Jun 18, 2026

Copy link
Copy Markdown

Summary

Adds tuned HIP GiMMiK preload and vector-width variants for PyFR-style matrix multiplication cases.

The new variants include C-preload kernels for cstream/bstream and aligned width-2 preload variants. Existing and new HIP templates now use shared non-temporal load/store helpers for C accesses.

Results

On MI300X PyFR GiMMiK matmul benchmarks across 53 double-precision p2-p5 cases, bandwidth efficiency versus a 4.4 TB/s target improved from:

  • Baseline: 76% min, 84% geomean
  • This branch: 86% min, 94% geomean

Companion PyFR PR: PyFR/PyFR#567

Test plan

  • Ran PyFR HIP GiMMiK matmul correctness checks against NumPy reference results.
  • Ran MI300X double-precision p2-p5 benchmark cases.

@tomjen12 tomjen12 marked this pull request as draft June 18, 2026 05:47
@tomjen12 tomjen12 marked this pull request as ready for review June 18, 2026 05:48
@tomjen12

Copy link
Copy Markdown
Author

Hi @FreddieWitherden , could you please review this PR when you have a chance?
I also noticed I cannot request reviewers from the sidebar. Is reviewer assignment limited to repository members?

if A[j, kx] != 0:
nzixs.append((l_idx, kx))

has_dotp = any(A[j, kx] != 0 for kx in range(k))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As A is a numpy array this can likely be simplified as A[j].any() (I think!). Same for the above for where we can eliminate the explicit Python loop with something built upon A[j, kbx].


% if width == 2:
static inline __device__ ${dtype}
gimmik_vmul(${dtype[:-1]} a, ${dtype} b)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth moving these up into a shared file which other kernels can include via Mako. Keeps all of the vector stuff in one place.

@FreddieWitherden

Copy link
Copy Markdown
Contributor

Overall looks good. Few quick things:

  • Confirm nothing breaks on the consumer SKU's. Performance is not critical here but some of our users use RDNA GPUs for local development and wave32 vs 64 has tripped us up in the past.
  • As we're now emitting more kernels this can have a negative impact on the start-up time of PyFR when they are JIT'ed for the first time. If you have reasonable heuristics it might be adopting the approach of PTX Backend #18 which has JSON files with tuning info (and am working on that PR to get the code moved into the base class so all generators can share it). PyFR passes in the wave_size and GCN arch name (queried from HIP) to GiMMiK which can be used for dispatch. This is especially useful if a certain trick always yields a benefit.

@FreddieWitherden FreddieWitherden self-assigned this Jun 18, 2026
@FreddieWitherden

Copy link
Copy Markdown
Contributor

Couple of other notes (mostly for me):

  • Confirm compatibility with PyFR v3.1 which is the latest released version. This only uses the static n kernel variants so shouldn't have any issues with wide width kernels (since GiMMiK provides the exact block sizes).
  • Confirm no regressions on MI250X which is widely used due to its deployment on supercomputers in the US and Europe. It is fine if performance doesn't improve, but we want to avoid regressions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants