Skip to content

hydrangeas20/kernelbench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

⚡ KernelBench — GPU Kernel Performance Analysis

This project investigates the performance of PyTorch eager execution vs. torch.compile across common deep learning operations, measuring execution time, throughput, TFLOPS, and memory bandwidth — and reports the result even where compilation made things worse, not only where it helped.

Why this exists

Modern deep learning workloads are increasingly bottlenecked by GPU kernel efficiency rather than model architecture alone. Compiler optimizations such as torch.compile are often treated as a free speedup. This benchmark tests that assumption directly: does compilation actually help, for which operations, and at what sizes — and the honest answer is it depends, and sometimes the answer is no.

A methodological note before the results

Many GPU benchmarks report isolated timing from a single tensor size or operation, making it hard to tell whether a reported speedup generalizes. This benchmark evaluates 4 operations across 4–5 tensor sizes each (19 configurations total), reporting execution time, throughput, TFLOPS or memory bandwidth, and relative speedup for every one — including the cases where torch.compile underperformed eager execution.

Headline Results

Operation Mean Speedup (compiled vs. eager) Range Result
layer_norm 2.12× 1.07× – 4.11× Consistent, substantial win
attention 1.06× 0.70× – 1.62× Near parity, modest win
matmul 0.91× 0.41× – 1.23× Net regression
softmax 0.64× 0.27× – 0.95× Net regression

torch.compile is not a uniform speedup. It delivered a substantial, consistent win for layer_norm (peak 4.11×) and a modest one for attention, but made matmul 9% slower and softmax 36% slower on average than eager PyTorch at the sizes tested. The most likely explanation: compilation overhead doesn't fully amortize for operations that are already cheap and well-fused in eager mode (softmax is a single tuned CUDA kernel; matmul is a single cuBLAS call) — there's simply less overhead left to remove. Full discussion in results/report.md.

Experimental Setup

  • Framework: PyTorch
  • Compiler: torch.compile; Triton kernel implemented for softmax only
  • Hardware: NVIDIA T4 GPU
  • Precision: float16
  • Warmup: 10 iterations, measured: 100 iterations per configuration
  • Operations: Matrix Multiplication (512–4096²), Softmax (1,024–65,536 elements/row), LayerNorm (256–4,096), Scaled Dot-Product Attention (seq_len 128–2,048)

Repository Structure

kernelbench/
│
├── notebook/
│   └── KernelBench.ipynb
│
├── figures/
│   ├── Peak Speedup.png
│   ├── Pytorch vs Compiled.png
│   ├── Compiled Kernel Speedup Ranked.png
│   ├── attention_latency.png
│   ├── matmul_tflops.png
│   └── speedup_summary.png
│
├── results/
│   ├── kernelbench.csv
│   ├── benchmark_summary.json
│   └── report.md
│
├── paper/
│   └── paper.md      
│
├── README.md
└── requirements.txt

Reproducing

Run the notebook from top to bottom on a CUDA-enabled GPU. The notebook will:

  1. Execute all 19 benchmark configurations across the 4 operations.
  2. Measure execution time and compute throughput, TFLOPS, and memory bandwidth.
  3. Generate the visualizations above.
  4. Save raw results to results/kernelbench.csv and figures to figures/.

Limitations

  • Single GPU type (T4) — results may differ on newer architectures with different compiler backends.
  • Triton kernel implemented for softmax only; the other 3 operations compare PyTorch eager vs. compiled only, not against a hand-tuned Triton baseline.
  • Single run per configuration — no variance estimate across repeated runs.

Future Work

  • Implement Triton kernels for matmul, layer_norm, and attention, to test whether they outperform both eager and compiled PyTorch.
  • Benchmark on a more modern GPU architecture (A100 or H100) to see whether the matmul/softmax regression is T4-specific.
  • Profile compiled kernels directly (e.g. torch.profiler) to confirm the compilation-overhead hypothesis rather than infer it from timing alone.
  • Compare TorchInductor against FlashAttention implementations.
  • Evaluate mixed-precision performance and extend to larger transformer workloads.

About

GPU systems benchmark comparing PyTorch eager execution and torch.compile across common deep learning operations. Measures execution time, throughput, TFLOPS, memory bandwidth, and kernel-level speedups through a reproducible benchmarking pipeline and publication-ready visualizations.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors