feat(cuda): bound the device block reuse cache with an env byte cap by NicolasRouquette · Pull Request #10 · lean-dojo/TorchLean

NicolasRouquette · 2026-06-23T21:10:31Z

Summary

Adds an opt-in byte cap on the CUDA buffer reuse cache (the exact-size device-block free list behind take_cached_block / return_cached_block).

The cache grows without bound: a dropped buffer is returned to it, and it is only emptied on a cudaMalloc-failure flush or at process exit. A loop over many distinct buffer sizes therefore accretes device memory that liveBytes does not even see — a returned block is accounted as freed before it is cached.

TORCHLEAN_CUDA_CACHE_CAP_BYTES (0 = unbounded, the prior behaviour exactly) bounds it: a returned block that would grow the cache past the cap is freed immediately — after waiting on its completion event, exactly as the flush path does, so an in-flight kernel never reads freed memory — instead of being cached.

What's in it

A running cache_bytes total under the cache mutex (incremented on cache-push, decremented on reuse, reset on flush) backs the cap decision.
New telemetry field AllocatorStats.cacheBytes (extern torchlean_cuda_allocator_cache_bytes; mirrored in the CPU stub returning 0, which keeps no cache); AllocatorStats.format gains cache=<MiB>.
CUDA-only: the stub frees on release and has no cache to cap.

Test

runCacheCapTest (in nn_tests_suite, runs on both the CUDA build and the CPU stub). The cap is read once natively, so — like the existing arena-detector death test — it forks the suite binary per configuration:

capped — a 1 MiB cap bounds an 8 MiB return workload to exactly 1 MiB;
control — with no cap the same workload caches the full 8 MiB.

On CUDA it also checks the cap is the binding constraint (the cache fills to within one block of it). On the stub cacheBytes stays 0, so the cap bound holds trivially.

scripts/checks/check.sh --ci-all --cuda: build + test + lint green.

Note: stacks on #9

This branches off #9 (cuda-arena-scope), reusing its buildArenaScratch test helper and fork harness. Until #9 merges, this PR shows #9's two commits as well; review only the last commit (feat(cuda): bound the device block reuse cache with an env byte cap). Once #9 merges to main, this PR reduces to that single commit. Merging #9 first is the intended order.

Long pure eager loops (a foldl of Buffer -> Buffer ops with no IO sequencing) keep every intermediate Buffer GC-reachable until the final readback, so the reference-counting finalizers that free device memory never run and the working set grows with the loop length. Explicit release cannot help from inside a pure carrier: there is no IO point at which to call it. withCudaArena opens an allocation epoch; every device buffer allocated while it is open is tracked. On exit it frees the device data of all of them -- reachable or not -- except a small `keep` set (the step's results), which are promoted to the parent epoch (or left to ordinary finalization at the outermost level). This matches a training loop's natural phase boundary (one step / one fold) and keeps the pure carrier unchanged; only the driver wraps each step. Mechanism (csrc/cuda/common/torchlean_cuda_arena.h, shared by the CUDA backend and the CPU stub): each tracked buffer points at a heap reg, and back; a buffer finalized mid-scope flips its reg's alive flag instead of leaving a dangling pointer, so the exit walk skips it. Registry mutation is mutex-guarded; the no-arena fast paths take no lock, so allocation and finalization outside an arena are unchanged. Wiring: buffer_alloc registers, finalize/drop_unboxed unlink, plus two IO export wrappers; the buffer struct gains one pointer field. Lean API: Buffer.arenaEnter / arenaExit / withCudaArena (Buffer.lean). Test: runArenaStress (NN/Tests/Runtime/Cuda/Stress.lean, wired into Stress.run) drives k live, never-released buffers through a scope -- the case release cannot reach -- holding them alive across the exit so only the arena can free them, and asserts via the allocator telemetry that (1) they are live in-scope, (2) the arena reclaims exactly them at exit with live bytes restored, and (3) a kept buffer is promoted and stays usable. The assertions use only the shared allocator counters, so the test is backend-agnostic; verified on the real CUDA backend, and the CPU stub compiles cleanly for GPU-less CI.

A buffer reclaimed by `arena_exit` keeps its Lean external object but has size 0 and freed/cached data; using it afterwards is a use-after-free. Before, that surfaced only as a cryptic "size mismatch (N vs 0)" deep in a kernel -- and slipped through silently when both operands were freed, since the bare size check passes (0 == 0). Add an opt-in detector (TORCHLEAN_ARENA_DEBUG=1): `arena_exit` records the reclaiming epoch in a new `arena_freed_depth` buffer field, and the `require_same_size{2,3}` choke point (which every binary/ternary op already calls) asserts liveness before comparing sizes, so a stale operand panics naming the op and epoch: use-after-arena-free: torchlean_cuda_buffer_add lhs was reclaimed by arena_exit at depth 0 When off (the default) the detector is one predicted branch on a cached env read plus an 8-byte field; the suite runs at the same wall-clock on or off. Regression coverage: `runArenaDetectorDeathTest` in nn_tests_suite (runs on both the CUDA build and the CPU stub). A detected UAF must panic, which cannot be caught in-process, so it forks the suite binary per configuration and inspects the outcome: * positive -- detector on + planted UAF => child aborts with the message * negative -- detector on + a valid promotion reused in a binary op (through the detector's own choke point) => clean exit * control -- detector off + planted UAF => clean exit (silent slip) check.sh --ci-all --cuda: build + test + lint all green.

The reuse cache (return/take of exact-size device blocks) grew without bound: a dropped buffer is returned to the cache, and the cache is only emptied on a cudaMalloc-failure flush or at process exit. A loop over many distinct sizes therefore accretes device memory that `liveBytes` does not see -- a returned block is accounted as freed before it is cached. Add an opt-in byte cap, `TORCHLEAN_CUDA_CACHE_CAP_BYTES` (0 = unbounded, the prior behaviour exactly): a returned block that would grow the cache past the cap is freed immediately -- after waiting on its completion event, exactly as the flush path does, so an in-flight kernel never reads freed memory -- instead of being cached. A running `cache_bytes` total under the cache mutex backs the decision and is surfaced as a new `AllocatorStats.cacheBytes` telemetry field (0 in the CPU stub, which keeps no cache and so has nothing to cap). Regression test `runCacheCapTest` (nn_tests_suite, runs on both the CUDA build and the CPU stub): the cap is read once natively, so -- like the arena detector death test -- it forks the suite binary per configuration. With a 1 MiB cap an 8 MiB return workload is bounded to exactly 1 MiB; with no cap (control) the same workload caches the full 8 MiB. On CUDA the test also checks the cap is the binding constraint (the cache fills to within one block of it). On the stub cacheBytes stays 0, so the cap bound holds trivially. scripts/checks/check.sh --ci-all --cuda: build + test + lint green.

NicolasRouquette added 3 commits June 23, 2026 07:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(cuda): bound the device block reuse cache with an env byte cap#10

feat(cuda): bound the device block reuse cache with an env byte cap#10
NicolasRouquette wants to merge 3 commits into
lean-dojo:mainfrom
NicolasRouquette:cuda-cache-byte-cap

NicolasRouquette commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

NicolasRouquette commented Jun 23, 2026

Summary

What's in it

Test

Note: stacks on #9

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant