Skip to content

Add tuned HIP GiMMiK preload-C and width variants with non-temporal loads and stores#19

Open
tomjen12 wants to merge 8 commits into
PyFR:masterfrom
tomjen12:hip-gimmik-optimized
Open

Add tuned HIP GiMMiK preload-C and width variants with non-temporal loads and stores#19
tomjen12 wants to merge 8 commits into
PyFR:masterfrom
tomjen12:hip-gimmik-optimized

Conversation

@tomjen12

@tomjen12 tomjen12 commented Jun 18, 2026

Copy link
Copy Markdown

Summary

Adds tuned HIP GiMMiK preload and vector-width variants for PyFR-style matrix multiplication cases.

The new variants include C-preload kernels for cstream/bstream and aligned width-2 preload variants. Existing and new HIP templates now use shared non-temporal load/store helpers for C accesses.

Results

On MI300X PyFR GiMMiK matmul benchmarks across 53 double-precision p2-p5 cases, bandwidth efficiency versus a 4.4 TB/s target improved from:

  • Baseline: 76% min, 84% geomean
  • This branch: 86% min, 94% geomean

Companion PyFR PR: PyFR/PyFR#567

Test plan

  • Ran PyFR HIP GiMMiK matmul correctness checks against NumPy reference results.
  • Ran MI300X double-precision p2-p5 benchmark cases.

@tomjen12 tomjen12 marked this pull request as draft June 18, 2026 05:47
@tomjen12 tomjen12 marked this pull request as ready for review June 18, 2026 05:48
@tomjen12

Copy link
Copy Markdown
Author

Hi @FreddieWitherden , could you please review this PR when you have a chance?
I also noticed I cannot request reviewers from the sidebar. Is reviewer assignment limited to repository members?

if A[j, kx] != 0:
nzixs.append((l_idx, kx))
has_dotp = any(A[j, kx] != 0 for kx in range(k))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As A is a numpy array this can likely be simplified as A[j].any() (I think!). Same for the above for where we can eliminate the explicit Python loop with something built upon A[j, kbx].


% if width == 2:
static inline __device__ ${dtype}
gimmik_vmul(${dtype[:-1]} a, ${dtype} b)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth moving these up into a shared file which other kernels can include via Mako. Keeps all of the vector stuff in one place.

@FreddieWitherden

Copy link
Copy Markdown
Contributor

Overall looks good. Few quick things:

  • Confirm nothing breaks on the consumer SKU's. Performance is not critical here but some of our users use RDNA GPUs for local development and wave32 vs 64 has tripped us up in the past.
  • As we're now emitting more kernels this can have a negative impact on the start-up time of PyFR when they are JIT'ed for the first time. If you have reasonable heuristics it might be adopting the approach of PTX Backend #18 which has JSON files with tuning info (and am working on that PR to get the code moved into the base class so all generators can share it). PyFR passes in the wave_size and GCN arch name (queried from HIP) to GiMMiK which can be used for dispatch. This is especially useful if a certain trick always yields a benefit.

@FreddieWitherden FreddieWitherden self-assigned this Jun 18, 2026
@FreddieWitherden

Copy link
Copy Markdown
Contributor

Couple of other notes (mostly for me):

  • Confirm compatibility with PyFR v3.1 which is the latest released version. This only uses the static n kernel variants so shouldn't have any issues with wide width kernels (since GiMMiK provides the exact block sizes).
  • Confirm no regressions on MI250X which is widely used due to its deployment on supercomputers in the US and Europe. It is fine if performance doesn't improve, but we want to avoid regressions.

@tomjen12

tomjen12 commented Jun 22, 2026

Copy link
Copy Markdown
Author

Thanks for the review and advice.
Updates made based on your comments:

  • Simplified the NumPy row checks using array operations, following the suggestion around A[j].any() and avoiding explicit Python loops where possible.
  • Moved the vector helper code into a shared Mako include so the width/vector logic is kept in one place and reused by the HIP kernels.
  • Confirmed compatibility with PyFR v3.1. The optimized HIP variants pass the correctness checks with PyFR v3.1.
  • Confirmed MI250X (gfx90a) correctness as well. The 22 variants pass accuracy checks on the tested gfx90a cases, and performance also looks improved overall.
  • For dispatch, I am currently gating the tuned HIP variants using the HIP-reported GCN arch and wave size:
    base_arch = gcn_arch.split(':', 1)[0] if gcn_arch else None
    if base_arch not in {'gfx90a', 'gfx942'} or warp_size != 64:
        return
        

This enables the tuned variants for validated gfx90a and gfx942 wave64 targets while avoiding other architectures for now.

  • RDNA / consumer GPU testing is still pending. I am currently looking for access to a suitable machine to verify wave32 behavior before enabling anything there.

@FreddieWitherden

Copy link
Copy Markdown
Contributor

Thanks!

Going through the PR there appear to be three main improvements.

  • Preloading Current GiMMiK kernels attempt to load parts of $C$ on demand, whereas the PR also considers doing the loads all upfront. If this is found to always be beneficial then we probably can eliminate the on-demand paths. (The original thoughts around incremental loading were to get some extra ILP, but I don't think I ever considered preloading.)
  • Non-temporal loads/stores These are used in some guise on NVIDIA already. Again, if we find they are always beneficial we should drop support for the non-temporal path. If they're neutral we can probably go either way. If there are cases where they help and others where they regress we should see if there is a simple heuristic to guide us. For example, on x86 CPUs we use non-temporal stores when the matrix is above a certain size (32 MiB if I recall). Should we keep both it is worth unifying the kernels. This can be done through either inline device functions, classic C preprocessor macros, or passing Python lamdba's to the Mako templates and evaluating them to yield the correct syntax. All of these approaches are fine.
  • Wider kernels This is something we used to have on HIP but it was removed in 07220ea Is there any reason this path can not be brought back as-is? The benefit is that it shares a kernel template with the current code.

All in, if I've understood everything correctly I do not think we need any new templates, just extra parameterization on the templates we have. This tends to make things more manageable and easier to understand so it would be preferable. Again, it is possible I've missed something about the new kernels which really does require a different overall structure, so let me know there.

@FreddieWitherden

Copy link
Copy Markdown
Contributor

Have just run the PR through my benchmark suite (p = 2 to 5, double precision, ~4 GiB of data so we always asymptote) and find the overall performance to be about the same (geomean ratio PR/base: 1.003x). Did your benchmarking also include the rocBLAS kernels? It could be that the wins are coming from kernels where rocBLAS does better than either the new or old GiMMiK kernels.

@tomjen12

tomjen12 commented Jun 23, 2026

Copy link
Copy Markdown
Author

Thanks for checking this.

Could you share a bit more detail about your benchmark setup?

  • What platform/GPU are you using for the test?
  • What ROCm version or Docker image are you using?
  • Could you share the benchmark script or command line you used? I can run the same setup on my side for comparison.
    Also, the new HIP variants are currently gated on validated wave64 targets:
base_arch = gcn_arch.split(':', 1)[0] if gcn_arch else None
if base_arch not in {'gfx90a', 'gfx942'} or warp_size != 64:
    return

Could you confirm whether your test platform reports gfx90a or gfx942 with warp_size == 64? If not, the new tuned variants may not be emitted. As a quick check, could you also try temporarily disabling this arch gate and rerun the benchmark?

@tomjen12

Copy link
Copy Markdown
Author

Regarding the template structure, I will also take another look and see how much can be merged back into the existing HIP templates.

@tomjen12

Copy link
Copy Markdown
Author

Thanks for the detailed feedback.

I have updated the HIP kernels so that width is now a template parameter rather than requiring separate width-specific templates. This lets the existing templates generate the wider variants directly.

For the non-temporal C load/store path, I tested the four basic HIP GiMMiK variants over the 53 PyFR matrix cases. Across 212 per-variant comparisons, the non-temporal path wins 190 cases and gives a +10.51% geomean bandwidth improvement. There are a few per-variant regressions, but for the affected cases other variants remain non-regressing, so the autotuner should still be able to avoid a loss in the final selected kernel. Based on this, I think using the non-temporal C path by default is justified.

image

For preload-C, the result is different. It is not uniformly beneficial: preload-C wins 83/212 comparisons, while the on-demand path wins 129/212, with an overall geomean bandwidth speedup of 0.94x. Some cases do improve substantially, so I think preload-C is still useful to keep as additional autotune candidates, but it should not replace the on-demand variants outright.

image

@FreddieWitherden

Copy link
Copy Markdown
Contributor

Could you confirm whether your test platform reports gfx90a or gfx942 with warp_size == 64? If not, the new tuned variants may not be emitted. As a quick check, could you also try temporarily disabling this arch gate and rerun the benchmark?

It was a caching issue on my end. I can now reproduce your results on an MI300X w/ROCm 7.14.

@FreddieWitherden

Copy link
Copy Markdown
Contributor

Thanks for the detailed feedback.

I have updated the HIP kernels so that width is now a template parameter rather than requiring separate width-specific templates. This lets the existing templates generate the wider variants directly.

Thank you for investigating. Assuming non-temporal loads/stores are widely available (so compiles for older CDNA and RDNA) I think the best course of action is to enable them everywhere all the time for HIP. This simplifies the code and puts us at parity with CUDA where we also ask for quasi-non temporal stores.

In terms of preloading I agree with making this an option in the auto-tuner. Extra kernel variations are not as big of a deal for HIP since the kernels are generic on n whereas on CUDA the sizes are baked into the code so we pay more of a hit.

Only other thing to check is that we yield kernels in a sensible order. To keep start-up times down PyFR can be instructed to only consider the first m kernels from GiMMiK. Hence, it is worth seeing if we can yield them in a sensible-ish order (or better still, if there are combos which never seem to win out, just prune them on our end). No need to strive for perfection here, it is more a "nice to have" rather than something essential.

(The long term goal is to make things adaptive; PyFR when it compiles a kernel passes information about the register use and spillage back to GiMMiK. At the moment this is discarded, but could be useful in the future for things like "don't bother with a width = 2 kernel if width = 1 spills".)

Finally, while it is not a priority at the moment, can you check there are no major regressions single precision.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants