Add tuned HIP GiMMiK preload-C and width variants with non-temporal loads and stores#19
Add tuned HIP GiMMiK preload-C and width variants with non-temporal loads and stores#19tomjen12 wants to merge 8 commits into
Conversation
|
Hi @FreddieWitherden , could you please review this PR when you have a chance? |
| if A[j, kx] != 0: | ||
| nzixs.append((l_idx, kx)) | ||
| has_dotp = any(A[j, kx] != 0 for kx in range(k)) |
There was a problem hiding this comment.
As A is a numpy array this can likely be simplified as A[j].any() (I think!). Same for the above for where we can eliminate the explicit Python loop with something built upon A[j, kbx].
|
|
||
| % if width == 2: | ||
| static inline __device__ ${dtype} | ||
| gimmik_vmul(${dtype[:-1]} a, ${dtype} b) |
There was a problem hiding this comment.
Might be worth moving these up into a shared file which other kernels can include via Mako. Keeps all of the vector stuff in one place.
|
Overall looks good. Few quick things:
|
|
Couple of other notes (mostly for me):
|
|
Thanks for the review and advice.
This enables the tuned variants for validated gfx90a and gfx942 wave64 targets while avoiding other architectures for now.
|
|
Thanks! Going through the PR there appear to be three main improvements.
All in, if I've understood everything correctly I do not think we need any new templates, just extra parameterization on the templates we have. This tends to make things more manageable and easier to understand so it would be preferable. Again, it is possible I've missed something about the new kernels which really does require a different overall structure, so let me know there. |
|
Have just run the PR through my benchmark suite (p = 2 to 5, double precision, ~4 GiB of data so we always asymptote) and find the overall performance to be about the same (geomean ratio PR/base: 1.003x). Did your benchmarking also include the rocBLAS kernels? It could be that the wins are coming from kernels where rocBLAS does better than either the new or old GiMMiK kernels. |
|
Thanks for checking this. Could you share a bit more detail about your benchmark setup?
base_arch = gcn_arch.split(':', 1)[0] if gcn_arch else None
if base_arch not in {'gfx90a', 'gfx942'} or warp_size != 64:
returnCould you confirm whether your test platform reports gfx90a or gfx942 with warp_size == 64? If not, the new tuned variants may not be emitted. As a quick check, could you also try temporarily disabling this arch gate and rerun the benchmark? |
|
Regarding the template structure, I will also take another look and see how much can be merged back into the existing HIP templates. |
It was a caching issue on my end. I can now reproduce your results on an MI300X w/ROCm 7.14. |
Thank you for investigating. Assuming non-temporal loads/stores are widely available (so compiles for older CDNA and RDNA) I think the best course of action is to enable them everywhere all the time for HIP. This simplifies the code and puts us at parity with CUDA where we also ask for quasi-non temporal stores. In terms of preloading I agree with making this an option in the auto-tuner. Extra kernel variations are not as big of a deal for HIP since the kernels are generic on Only other thing to check is that we yield kernels in a sensible order. To keep start-up times down PyFR can be instructed to only consider the first (The long term goal is to make things adaptive; PyFR when it compiles a kernel passes information about the register use and spillage back to GiMMiK. At the moment this is discarded, but could be useful in the future for things like "don't bother with a width = 2 kernel if width = 1 spills".) Finally, while it is not a priority at the moment, can you check there are no major regressions single precision. |


Summary
Adds tuned HIP GiMMiK preload and vector-width variants for PyFR-style matrix multiplication cases.
The new variants include C-preload kernels for cstream/bstream and aligned width-2 preload variants. Existing and new HIP templates now use shared non-temporal load/store helpers for C accesses.
Results
On MI300X PyFR GiMMiK matmul benchmarks across 53 double-precision p2-p5 cases, bandwidth efficiency versus a 4.4 TB/s target improved from:
Companion PyFR PR: PyFR/PyFR#567
Test plan