Add withCudaArena: scoped device-memory reclamation for long eager loops by NicolasRouquette · Pull Request #9 · lean-dojo/TorchLean

NicolasRouquette · 2026-06-23T15:03:25Z

Summary

Adds withCudaArena, a scoped device-memory arena for the eager CUDA backend.

Large CUDA buffers are wrapped as Lean external objects freed by reference-counting finalizers. In a
long pure eager loop — a foldl of Buffer → Buffer ops with no IO sequencing — every
intermediate Buffer stays GC-reachable until the final readback, so the finalizers never run and
the device working set grows with the loop length. Explicit release cannot help from inside a pure
carrier: there is no IO point at which to call it.

withCudaArena opens an allocation epoch; every device buffer allocated while it is open is tracked.
On exit it frees the device data of all of them — reachable or not — except a small keep set
(the step's results), which are promoted to the parent epoch (or left to ordinary finalization at the
outermost level). This matches a training loop's natural phase boundary (one step / one fold), and it
is compatible with a pure carrier: the carrier is unchanged; only the driver wraps each step.

Mechanism

csrc/cuda/common/torchlean_cuda_arena.h (a new shared header, used by both the CUDA backend and the
CPU stub) holds an epoch registry:

each tracked buffer points at a heap registration record, and back;
a buffer finalized mid-scope flips its record's alive flag instead of leaving a dangling
pointer, so the exit walk skips it (no use-after-free, no double-free);
registry mutation is mutex-guarded; the no-arena fast paths take no lock, so allocation and
finalization outside an arena are unchanged.

Wiring is minimal: torchlean_cuda_buffer_alloc registers, finalize/drop_unboxed unlink, plus two
IO export wrappers (torchlean_cuda_arena_enter/_exit). The buffer struct gains one pointer field.

Lean API

Buffer.arenaEnter : IO Unit
Buffer.arenaExit  (keep : @& Array Buffer) : IO Unit
Buffer.withCudaArena (keep : α → Array Buffer) (body : IO α) : IO α  -- exception-safe

Contract: every buffer that must outlive the scope must appear in keep; touching any other
in-scope buffer after arenaExit is a use-after-free, the same hazard as reading a buffer after
release.

Test

runArenaStress (NN/Tests/Runtime/Cuda/Stress.lean, wired into Stress.run) drives k live,
never-released buffers through a scope — the case release cannot reach — holding them alive
across the exit so only the arena can free them. Using the existing allocator telemetry it asserts:

the k buffers are live in-scope (allocated, not yet freed);
scope exit reclaims exactly them — freeCount += k, live bytes restored — even though they are
still referenced;
a kept buffer is promoted out of the scope and stays usable afterward.

The assertions read only the shared allocator counters (Buffer.allocatorStats), which have CPU-stub
parity, and the arena registry itself is one header shared by both backends — so the test is
backend-agnostic. Verified by execution on the real CUDA backend; the CPU-stub backend compiles
cleanly (so it builds in GPU-less CI).

Observed on the real CUDA backend (-K cuda=true):

== cuda arena scope stress ==
  baseline:        live=0.080238 MiB ... allocs=954 frees=476 ... cuda_total=20009.250000 MiB
  reclaimed 64 live in-scope buffers across 4 arenas
  promoted buffer survived its arena
...
== TorchLean: all curated tests passed ==

Notes

Additive: no production op changed; allocation/finalization outside an arena is byte-for-byte the
prior behavior.
Threading: arenaEnter/arenaExit and the allocations between them run on one driver thread;
finalizers (which only ever unlink) may run on any thread and are mutex-safe. Concurrent arenas on
multiple threads are out of scope for this first cut.

Follow-up: use-after-arena-free detector (commit `eed66d6`)

A buffer reclaimed by arenaExit keeps its Lean external object but has size 0 and freed/cached data;
using it afterwards is a use-after-free. It previously surfaced only as a cryptic size mismatch (N vs 0)
deep in a kernel — and slipped through silently when both operands were freed, since the bare size
check passes (0 == 0).

This adds an opt-in detector (TORCHLEAN_ARENA_DEBUG=1): arenaExit records the reclaiming epoch in a
new arena_freed_depth buffer field, and the require_same_size{2,3} choke point (which every
binary/ternary op already calls) asserts liveness before comparing sizes, so a stale operand panics
naming the op, operand, and epoch:

use-after-arena-free: torchlean_cuda_buffer_mul rhs was reclaimed by arena_exit at depth 0

When off (the default), the detector is one predicted branch on a cached env read plus an 8-byte field;
the suite runs at the same wall-clock on or off.

Regression test runArenaDetectorDeathTest (in nn_tests_suite, runs in the normal suite on both
the CUDA build and the CPU stub). A detected UAF must panic, which cannot be caught in-process, so it
forks the suite binary per configuration and inspects the outcome:

positive — detector on + a planted UAF ⇒ the child aborts with the message;
negative — detector on + a valid promotion reused in a binary op (through the detector's own
choke point) ⇒ clean exit, so the detector never fires on a kept buffer;
control — detector off + the planted UAF ⇒ clean exit (the silent slip the detector closes).

It already earned its keep: it pinpointed a real cross-call use-after-arena-free in a downstream
nested-arena consumer — a loop-invariant constant, CSE'd to one allocation shared across repeated fit
calls, allocated in the first call's outer epoch and reused after that epoch freed it — which the bare
size mismatch message could not diagnose.

scripts/checks/check.sh --ci-all --cuda: build + test + lint all green.

Long pure eager loops (a foldl of Buffer -> Buffer ops with no IO sequencing) keep every intermediate Buffer GC-reachable until the final readback, so the reference-counting finalizers that free device memory never run and the working set grows with the loop length. Explicit release cannot help from inside a pure carrier: there is no IO point at which to call it. withCudaArena opens an allocation epoch; every device buffer allocated while it is open is tracked. On exit it frees the device data of all of them -- reachable or not -- except a small `keep` set (the step's results), which are promoted to the parent epoch (or left to ordinary finalization at the outermost level). This matches a training loop's natural phase boundary (one step / one fold) and keeps the pure carrier unchanged; only the driver wraps each step. Mechanism (csrc/cuda/common/torchlean_cuda_arena.h, shared by the CUDA backend and the CPU stub): each tracked buffer points at a heap reg, and back; a buffer finalized mid-scope flips its reg's alive flag instead of leaving a dangling pointer, so the exit walk skips it. Registry mutation is mutex-guarded; the no-arena fast paths take no lock, so allocation and finalization outside an arena are unchanged. Wiring: buffer_alloc registers, finalize/drop_unboxed unlink, plus two IO export wrappers; the buffer struct gains one pointer field. Lean API: Buffer.arenaEnter / arenaExit / withCudaArena (Buffer.lean). Test: runArenaStress (NN/Tests/Runtime/Cuda/Stress.lean, wired into Stress.run) drives k live, never-released buffers through a scope -- the case release cannot reach -- holding them alive across the exit so only the arena can free them, and asserts via the allocator telemetry that (1) they are live in-scope, (2) the arena reclaims exactly them at exit with live bytes restored, and (3) a kept buffer is promoted and stays usable. The assertions use only the shared allocator counters, so the test is backend-agnostic; verified on the real CUDA backend, and the CPU stub compiles cleanly for GPU-less CI.

A buffer reclaimed by `arena_exit` keeps its Lean external object but has size 0 and freed/cached data; using it afterwards is a use-after-free. Before, that surfaced only as a cryptic "size mismatch (N vs 0)" deep in a kernel -- and slipped through silently when both operands were freed, since the bare size check passes (0 == 0). Add an opt-in detector (TORCHLEAN_ARENA_DEBUG=1): `arena_exit` records the reclaiming epoch in a new `arena_freed_depth` buffer field, and the `require_same_size{2,3}` choke point (which every binary/ternary op already calls) asserts liveness before comparing sizes, so a stale operand panics naming the op and epoch: use-after-arena-free: torchlean_cuda_buffer_add lhs was reclaimed by arena_exit at depth 0 When off (the default) the detector is one predicted branch on a cached env read plus an 8-byte field; the suite runs at the same wall-clock on or off. Regression coverage: `runArenaDetectorDeathTest` in nn_tests_suite (runs on both the CUDA build and the CPU stub). A detected UAF must panic, which cannot be caught in-process, so it forks the suite binary per configuration and inspects the outcome: * positive -- detector on + planted UAF => child aborts with the message * negative -- detector on + a valid promotion reused in a binary op (through the detector's own choke point) => clean exit * control -- detector off + planted UAF => clean exit (silent slip) check.sh --ci-all --cuda: build + test + lint all green.

NicolasRouquette added 2 commits June 23, 2026 07:59

NicolasRouquette mentioned this pull request Jun 23, 2026

feat(cuda): bound the device block reuse cache with an env byte cap #10

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add withCudaArena: scoped device-memory reclamation for long eager loops#9

Add withCudaArena: scoped device-memory reclamation for long eager loops#9
NicolasRouquette wants to merge 2 commits into
lean-dojo:mainfrom
NicolasRouquette:cuda-arena-scope

NicolasRouquette commented Jun 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

NicolasRouquette commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Mechanism

Lean API

Test

Notes

Follow-up: use-after-arena-free detector (commit eed66d6)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

NicolasRouquette commented Jun 23, 2026 •

edited

Loading

Follow-up: use-after-arena-free detector (commit `eed66d6`)