A unified HPC toolchain — deterministic compiler, async runtime, CUTLASS kernels, and SciML inference.
PyC pairs a rock-stable, cross-platform CMake core with an experimental "compiler-next" optimization stack: an SSA-style IR, a deterministic pass pipeline, a memory planner, kernel autotuning, and a reason-coded CUDA dispatch path with graph replay.
- Why PyC
- Architecture
- Repository Layout
- Quick Start
- Build Targets
- Using PyC
- Benchmarking
- GPU Benchmarking
- Documentation
- Project Status
- Contributing
- License
PyC is built around two contracts that evolve independently:
| Contract | Promise | Scope |
|---|---|---|
| Stable core | Reproducible CMake targets, deterministic CI, clean downstream linking | pyc_core_obj, pyc_core, pyc_foundation, pyc |
| Compiler-next | Fast-moving optimization research with explicit, observable behavior | IR, passes, allocator, kernel registry, runtime rails, CUDA backend, AI bridge |
The guiding design rule is isolation: the experimental compiler stack can move quickly without ever destabilizing the stable artifacts that CI and downstream consumers depend on.
What sets the compiler-next stack apart:
- Deterministic by construction — module fingerprinting, a content-addressed compile cache, and strict-mode guards that fail loud with explicit reasons instead of drifting silently.
- Reason-coded fallback — every CUDA dispatch records why it took the native path, replayed a captured graph, or fell back to the CPU executor.
- Observable optimization — decision logs and
pyc_run_statsexpose objective mode, pressure score, kernel selection traces, graph-break taxonomy, and autotune state. - Co-designed memory + kernels — a first-fit lifetime-reuse allocator and a scoring kernel registry that share pressure signals.
PyC is a polyglot stack. A thin Python/Rust surface sits over a deterministic C engine, which dispatches to CUDA/CUTLASS kernels when a GPU is present and falls back to a CPU executor otherwise.
flowchart TD
subgraph Surface["Surface — Python / Rust"]
PY["pyc Python package<br/>(CLI, runtime control plane)"]
API["FastAPI inference portal<br/>apps/inference_api"]
RS["vortex_core (Rust)<br/>async runtime + PyO3 bindings"]
end
subgraph Engine["Compiler-next Engine — C11"]
CAPI["compiler_api.c<br/>compile/run lifecycle, cache, decision log"]
IR["ir.c<br/>op graph, verify, serialize"]
PASS["pass_manager.c<br/>canonicalize · shape · fuse · liveness · graph-break"]
ALLOC["runtime_allocator.c<br/>lifetime reuse + rematerialization"]
KREG["kernel_registry.c<br/>candidate scoring + autotune"]
CTRL["runtime_control.c<br/>objective mode + rollback rails"]
end
subgraph Backends["Backends"]
CUDA["cuda_backend.c<br/>native cuBLAS/cuBLASLt + graph replay"]
CUTLASS["CUTLASS kernels<br/>GEMM · attention · conv2d · Ada fast path"]
CPU["CPU executor<br/>deterministic reference path"]
COMM["collective comm<br/>MPI · NCCL · RCCL · stub"]
end
PY --> RS --> CAPI
API --> CAPI
CAPI --> IR --> PASS --> ALLOC --> KREG --> CTRL
CTRL --> CUDA
CUDA --> CUTLASS
CUDA -. reason-coded fallback .-> CPU
CTRL --> COMM
Compile → run lifecycle:
flowchart LR
A["pyc_compile_model"] --> B["verify IR"]
B --> C["fingerprint + cache key"]
C --> D{"cache hit?"}
D -- yes --> E["restore plan/kernel"]
D -- no --> F["pass pipeline"]
F --> G["allocation plan"]
G --> H["select kernel + autotune"]
H --> I["store cache entry"]
E --> J["pyc_run_model"]
I --> J
J --> K{"GPU available?"}
K -- yes --> L["CUDA dispatch<br/>(native / graph replay)"]
K -- no --> M["CPU executor"]
L --> N["runtime rails + stats"]
M --> N
For the full component matrix, IR rules, pass order, allocator algorithm, selection scoring, and CUDA control variables, see docs/architecture/system-architecture.md.
PyC/
├── src/
│ ├── core/ # Stable C core: file adapter, symbol table, stack, CI driver
│ ├── compiler/ # Compiler-next: IR, passes, runtime rails, CUDA backend, AI bridge
│ │ ├── ir/ # Op-graph IR + deterministic serialization
│ │ ├── passes/ # Canonicalize, shape inference, fusion, liveness, graph-break
│ │ ├── runtime/ # Allocator, kernel registry, controller, CUDA + comm backends
│ │ └── cutlass_kernels/ # GEMM, attention, conv2d, Ada async fast path
│ └── runtime/ # Rust async runtime (vortex_core) + PyO3 bindings
├── include/pyc/ # Public C headers (the compiler-next API surface)
├── python/pyc/ # Python package: CLI, runtime control plane, telemetry
├── apps/inference_api/ # FastAPI inference portal
├── kernels/ # Kernel lab tooling + Ada/Hopper prototype families
├── benchmark/ # Deterministic benchmark harness, workloads, regression gates
├── tests/ # C compiler-next tests + Python tests + repo validators
├── infra/ # GPU-host bootstrap (Docker image + setup scripts)
├── scripts/ # Developer + benchmark automation
├── docs/ # Architecture, references, plans, reports, roadmap
└── web/site/ # Static site, download page, inference portal, published results
Prerequisites: CMake ≥ 3.10, a C11 compiler, and Python 3 (for the benchmark harness).
# Configure and build the stable core targets
cmake -S . -B build
cmake --build build --parallel --target pyc pyc_core pyc_foundation
# Smoke test
./build/pyc # Linux/macOS
# .\build\Release\pyc.exe # Windows multi-config generatorsExpected output:
PyC CI driver: core targets configured successfully.
To build the experimental compiler-next stack and its smoke test:
cmake -S . -B build -D PYC_BUILD_COMPILER_NEXT=ON -D PYC_BUILD_COMPILER_NEXT_TESTS=ON
cmake --build build --parallel --target pyc_compiler_next pyc_compiler_next_smoke
./build/pyc_compiler_next_smoke
ctest --test-dir build --output-on-failureIncremental builds: reuse the same
build/directory and re-runcmake --build build --parallel. CI's Linux/macOS jobs useccacheto cut repeated compile time.
| Target | Type | Description |
|---|---|---|
pyc_core_obj |
object library | Stable core objects, compiled once |
pyc_core |
static library | Canonical static library for downstream linking |
pyc_foundation |
static library | Compatibility alias of the same objects |
pyc |
executable | Minimal deterministic CI/smoke driver |
pyc_compiler_next |
static library | Compiler-next engine (requires PYC_BUILD_COMPILER_NEXT=ON) |
pyc_compiler_next_smoke |
executable | Compiler-next smoke test |
pyc_compiler_next_test_* |
executables | Deterministic compiler-next test suite (requires PYC_BUILD_COMPILER_NEXT_TESTS=ON) |
A single canonical workflow — cmake-multi-platform.yml — runs on Ubuntu, macOS, and Windows:
- CMake configure
- Build + smoke-test the compiler-next targets
ctestacross the deterministic suite- Build + smoke-test the stable targets
- Benchmark regression guardrail (Ubuntu)
CI also enforces source coverage: every active .c file under src/core/C_Files, src/compiler, and tests/compiler_next must be referenced by CMakeLists.txt, or the build fails.
cmake -S . -B build
cmake --build build --parallel --target pyc_coreIn your C/C++ project, add the include path src/core/Header_Files/ and link build/libpyc_core.a (or the platform equivalent). Use pyc_foundation only where downstream compatibility requires it.
The compiler-next surface is the set of headers under include/pyc/:
| Header | Purpose |
|---|---|
compiler_api.h |
Compile/run lifecycle, options, output stats, decision log |
ir.h |
Op-graph IR construction and verification |
pass_manager.h |
Pass pipeline configuration and reports |
runtime_allocator.h |
Lifetime-based allocation planner |
kernel_registry.h |
Kernel candidate registration, scoring, autotune |
runtime_control.h |
Objective-mode switching and rollback rails |
cuda_backend.h |
CUDA dispatch, graph replay, reason-coded fallback |
ai_bridge.h |
AI-assisted optimization bridge layer |
PyC ships a deterministic benchmark harness for the stable core targets.
python3 benchmark/harness.py --repeats 7 --micro-rounds 4000Outputs:
benchmark/benchmarks/results/json/latest_core.jsonbenchmark/benchmarks/results/reports/latest_core.mddocs/reports/performance-results.md
Publish website-ready artifacts:
python3 scripts/publish_site_results.py
# -> web/site/results/manifest.json, latest-summary.json, artifacts/**See docs/reference/benchmarking.md for methodology.
For real GPU testing on a rented Linux host:
# 1. Provision Ubuntu + NVIDIA GPU, then reuse the bootstrap image
bash infra/build_bootstrap_image.sh
INSTALL_SYSTEM_DEPS=0 bash infra/run_bootstrap_image.sh
# 2. Validate the toolchain if needed
bash scripts/setup_cuda_remote_ubuntu.sh
source .venv/bin/activate
# 3. Run the standardized GPU suite
python3 benchmark/benchmarks/gpu/run_gpu_suite.py --device cuda --tag gpu_baselineHuman-readable single run:
bash scripts/run_pyc_bench_pretty.sh cuda 64 1024 5 2Kernel-level Ada work via the kernel lab:
python3 kernels/lab/kernel_lab.py task-create ada-sm89-gemm --task-kind gemm --candidate-tag ada
python3 benchmark/benchmarks/gpu/run_gemm_suite.py \
--matrix-file benchmark/benchmarks/gpu/configs/ada_fp32_gemm_shapes.json --dry-runThe adapter comparison covers torch_eager, torch_compile, pyc, tvm, xla, tensorrt, and glow. For non-PyTorch backends, point the corresponding env var (e.g. TVM_BENCH_CMD, TENSORRT_BENCH_CMD, PYC_GPU_BENCH_CMD) at a command that emits JSON. Full guide: docs/compiler-next/gpu-testing-playbook.md.
Start with the documentation index. High-value entry points:
| Topic | Document |
|---|---|
| System architecture & data flow | docs/architecture/system-architecture.md |
| Project overview & terminology | docs/reference/project-overview.md |
| Build & CI reference | docs/reference/build-and-ci.md |
| Benchmark methodology | docs/reference/benchmarking.md |
| Compiler-next overview | docs/compiler-next/overview.md |
| IR specification | docs/compiler-next/ir-spec.md |
| Pass pipeline | docs/compiler-next/pass-pipeline.md |
| Memory planner | docs/compiler-next/runtime-memory-planner.md |
| CUDA GEMM fast path | docs/compiler-next/cuda-gemm-fast-path.md |
| Project status | docs/reports/project-status.md |
- Stable core — cross-platform CMake/link targets are in place and gated by deterministic CI.
- Compiler-next — functional behind
PYC_BUILD_COMPILER_NEXT=ON; covered by a deterministic test suite but not yet part of the stable CI guarantees. The stack spans deterministic guards, compile cache, speculative plans, phantom-graph tracking, runtime controller rails, rematerialization policy, kernel + allocator co-selection, and the Ada FP32 CUDA fast path. - Release binaries — published per OS by
release-binaries.yml(pyc-linux-x86_64.tar.gz,pyc-macos-arm64.tar.gz,pyc-windows-x86_64.zip), with an OS-detecting download page atweb/site/index.html.
Contributions are welcome. Please read CONTRIBUTING.md for the workflow and validation checklist, docs/reference/repository-rules.md for layout guardrails, and CODE_OF_CONDUCT.md. Security issues: see SECURITY.md.
Licensed under the Apache License 2.0.