⚡ KernelBench — GPU Kernel Performance Analysis

This project investigates the performance of PyTorch eager execution vs. torch.compile across common deep learning operations, measuring execution time, throughput, TFLOPS, and memory bandwidth — and reports the result even where compilation made things worse, not only where it helped.

Why this exists

Modern deep learning workloads are increasingly bottlenecked by GPU kernel efficiency rather than model architecture alone. Compiler optimizations such as torch.compile are often treated as a free speedup. This benchmark tests that assumption directly: does compilation actually help, for which operations, and at what sizes — and the honest answer is it depends, and sometimes the answer is no.

A methodological note before the results

Many GPU benchmarks report isolated timing from a single tensor size or operation, making it hard to tell whether a reported speedup generalizes. This benchmark evaluates 4 operations across 4–5 tensor sizes each (19 configurations total), reporting execution time, throughput, TFLOPS or memory bandwidth, and relative speedup for every one — including the cases where torch.compile underperformed eager execution.

Headline Results

Operation	Mean Speedup (compiled vs. eager)	Range	Result
layer_norm	2.12×	1.07× – 4.11×	Consistent, substantial win
attention	1.06×	0.70× – 1.62×	Near parity, modest win
matmul	0.91×	0.41× – 1.23×	Net regression
softmax	0.64×	0.27× – 0.95×	Net regression

torch.compile is not a uniform speedup. It delivered a substantial, consistent win for layer_norm (peak 4.11×) and a modest one for attention, but made matmul 9% slower and softmax 36% slower on average than eager PyTorch at the sizes tested. The most likely explanation: compilation overhead doesn't fully amortize for operations that are already cheap and well-fused in eager mode (softmax is a single tuned CUDA kernel; matmul is a single cuBLAS call) — there's simply less overhead left to remove. Full discussion in results/report.md.

Experimental Setup

Framework: PyTorch
Compiler: torch.compile; Triton kernel implemented for softmax only
Hardware: NVIDIA T4 GPU
Precision: float16
Warmup: 10 iterations, measured: 100 iterations per configuration
Operations: Matrix Multiplication (512–4096²), Softmax (1,024–65,536 elements/row), LayerNorm (256–4,096), Scaled Dot-Product Attention (seq_len 128–2,048)

Repository Structure

kernelbench/
│
├── notebook/
│   └── KernelBench.ipynb
│
├── figures/
│   ├── Peak Speedup.png
│   ├── Pytorch vs Compiled.png
│   ├── Compiled Kernel Speedup Ranked.png
│   ├── attention_latency.png
│   ├── matmul_tflops.png
│   └── speedup_summary.png
│
├── results/
│   ├── kernelbench.csv
│   ├── benchmark_summary.json
│   └── report.md
│
├── paper/
│   └── paper.md      
│
├── README.md
└── requirements.txt

Reproducing

Run the notebook from top to bottom on a CUDA-enabled GPU. The notebook will:

Execute all 19 benchmark configurations across the 4 operations.
Measure execution time and compute throughput, TFLOPS, and memory bandwidth.
Generate the visualizations above.
Save raw results to results/kernelbench.csv and figures to figures/.

Limitations

Single GPU type (T4) — results may differ on newer architectures with different compiler backends.
Triton kernel implemented for softmax only; the other 3 operations compare PyTorch eager vs. compiled only, not against a hand-tuned Triton baseline.
Single run per configuration — no variance estimate across repeated runs.

Future Work

Implement Triton kernels for matmul, layer_norm, and attention, to test whether they outperform both eager and compiled PyTorch.
Benchmark on a more modern GPU architecture (A100 or H100) to see whether the matmul/softmax regression is T4-specific.
Profile compiled kernels directly (e.g. torch.profiler) to confirm the compilation-overhead hypothesis rather than infer it from timing alone.
Compare TorchInductor against FlashAttention implementations.
Evaluate mixed-precision performance and extend to larger transformer workloads.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

⚡ KernelBench — GPU Kernel Performance Analysis

Why this exists

A methodological note before the results

Headline Results

Experimental Setup

Repository Structure

Reproducing

Limitations

Future Work

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
figures		figures
notebook		notebook
results		results
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

⚡ KernelBench — GPU Kernel Performance Analysis

Why this exists

A methodological note before the results

Headline Results

Experimental Setup

Repository Structure

Reproducing

Limitations

Future Work

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages