This project investigates the performance of PyTorch eager execution vs. torch.compile across common deep learning operations, measuring execution time, throughput, TFLOPS, and memory bandwidth — and reports the result even where compilation made things worse, not only where it helped.
Modern deep learning workloads are increasingly bottlenecked by GPU kernel efficiency rather than model architecture alone. Compiler optimizations such as torch.compile are often treated as a free speedup. This benchmark tests that assumption directly: does compilation actually help, for which operations, and at what sizes — and the honest answer is it depends, and sometimes the answer is no.
Many GPU benchmarks report isolated timing from a single tensor size or operation, making it hard to tell whether a reported speedup generalizes. This benchmark evaluates 4 operations across 4–5 tensor sizes each (19 configurations total), reporting execution time, throughput, TFLOPS or memory bandwidth, and relative speedup for every one — including the cases where torch.compile underperformed eager execution.
| Operation | Mean Speedup (compiled vs. eager) | Range | Result |
|---|---|---|---|
| layer_norm | 2.12× | 1.07× – 4.11× | Consistent, substantial win |
| attention | 1.06× | 0.70× – 1.62× | Near parity, modest win |
| matmul | 0.91× | 0.41× – 1.23× | Net regression |
| softmax | 0.64× | 0.27× – 0.95× | Net regression |
torch.compile is not a uniform speedup. It delivered a substantial, consistent win for layer_norm (peak 4.11×) and a modest one for attention, but made matmul 9% slower and softmax 36% slower on average than eager PyTorch at the sizes tested. The most likely explanation: compilation overhead doesn't fully amortize for operations that are already cheap and well-fused in eager mode (softmax is a single tuned CUDA kernel; matmul is a single cuBLAS call) — there's simply less overhead left to remove. Full discussion in results/report.md.
- Framework: PyTorch
- Compiler:
torch.compile; Triton kernel implemented for softmax only - Hardware: NVIDIA T4 GPU
- Precision: float16
- Warmup: 10 iterations, measured: 100 iterations per configuration
- Operations: Matrix Multiplication (512–4096²), Softmax (1,024–65,536 elements/row), LayerNorm (256–4,096), Scaled Dot-Product Attention (seq_len 128–2,048)
kernelbench/
│
├── notebook/
│ └── KernelBench.ipynb
│
├── figures/
│ ├── Peak Speedup.png
│ ├── Pytorch vs Compiled.png
│ ├── Compiled Kernel Speedup Ranked.png
│ ├── attention_latency.png
│ ├── matmul_tflops.png
│ └── speedup_summary.png
│
├── results/
│ ├── kernelbench.csv
│ ├── benchmark_summary.json
│ └── report.md
│
├── paper/
│ └── paper.md
│
├── README.md
└── requirements.txt
Run the notebook from top to bottom on a CUDA-enabled GPU. The notebook will:
- Execute all 19 benchmark configurations across the 4 operations.
- Measure execution time and compute throughput, TFLOPS, and memory bandwidth.
- Generate the visualizations above.
- Save raw results to
results/kernelbench.csvand figures tofigures/.
- Single GPU type (T4) — results may differ on newer architectures with different compiler backends.
- Triton kernel implemented for softmax only; the other 3 operations compare PyTorch eager vs. compiled only, not against a hand-tuned Triton baseline.
- Single run per configuration — no variance estimate across repeated runs.
- Implement Triton kernels for matmul, layer_norm, and attention, to test whether they outperform both eager and compiled PyTorch.
- Benchmark on a more modern GPU architecture (A100 or H100) to see whether the matmul/softmax regression is T4-specific.
- Profile compiled kernels directly (e.g.
torch.profiler) to confirm the compilation-overhead hypothesis rather than infer it from timing alone. - Compare TorchInductor against FlashAttention implementations.
- Evaluate mixed-precision performance and extend to larger transformer workloads.