Skip to content

Add withCudaArena: scoped device-memory reclamation for long eager loops#9

Open
NicolasRouquette wants to merge 2 commits into
lean-dojo:mainfrom
NicolasRouquette:cuda-arena-scope
Open

Add withCudaArena: scoped device-memory reclamation for long eager loops#9
NicolasRouquette wants to merge 2 commits into
lean-dojo:mainfrom
NicolasRouquette:cuda-arena-scope

Conversation

@NicolasRouquette

@NicolasRouquette NicolasRouquette commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds withCudaArena, a scoped device-memory arena for the eager CUDA backend.

Large CUDA buffers are wrapped as Lean external objects freed by reference-counting finalizers. In a
long pure eager loop — a foldl of Buffer → Buffer ops with no IO sequencing — every
intermediate Buffer stays GC-reachable until the final readback, so the finalizers never run and
the device working set grows with the loop length. Explicit release cannot help from inside a pure
carrier: there is no IO point at which to call it.

withCudaArena opens an allocation epoch; every device buffer allocated while it is open is tracked.
On exit it frees the device data of all of them — reachable or not — except a small keep set
(the step's results), which are promoted to the parent epoch (or left to ordinary finalization at the
outermost level). This matches a training loop's natural phase boundary (one step / one fold), and it
is compatible with a pure carrier: the carrier is unchanged; only the driver wraps each step.

Mechanism

csrc/cuda/common/torchlean_cuda_arena.h (a new shared header, used by both the CUDA backend and the
CPU stub) holds an epoch registry:

  • each tracked buffer points at a heap registration record, and back;
  • a buffer finalized mid-scope flips its record's alive flag instead of leaving a dangling
    pointer, so the exit walk skips it (no use-after-free, no double-free);
  • registry mutation is mutex-guarded; the no-arena fast paths take no lock, so allocation and
    finalization outside an arena are unchanged.

Wiring is minimal: torchlean_cuda_buffer_alloc registers, finalize/drop_unboxed unlink, plus two
IO export wrappers (torchlean_cuda_arena_enter/_exit). The buffer struct gains one pointer field.

Lean API

Buffer.arenaEnter : IO Unit
Buffer.arenaExit  (keep : @& Array Buffer) : IO Unit
Buffer.withCudaArena (keep : α → Array Buffer) (body : IO α) : IO α  -- exception-safe

Contract: every buffer that must outlive the scope must appear in keep; touching any other
in-scope buffer after arenaExit is a use-after-free, the same hazard as reading a buffer after
release.

Test

runArenaStress (NN/Tests/Runtime/Cuda/Stress.lean, wired into Stress.run) drives k live,
never-released buffers through a scope — the case release cannot reach — holding them alive
across the exit so only the arena can free them. Using the existing allocator telemetry it asserts:

  1. the k buffers are live in-scope (allocated, not yet freed);
  2. scope exit reclaims exactly them — freeCount += k, live bytes restored — even though they are
    still referenced;
  3. a kept buffer is promoted out of the scope and stays usable afterward.

The assertions read only the shared allocator counters (Buffer.allocatorStats), which have CPU-stub
parity, and the arena registry itself is one header shared by both backends — so the test is
backend-agnostic. Verified by execution on the real CUDA backend; the CPU-stub backend compiles
cleanly (so it builds in GPU-less CI).

Observed on the real CUDA backend (-K cuda=true):

== cuda arena scope stress ==
  baseline:        live=0.080238 MiB ... allocs=954 frees=476 ... cuda_total=20009.250000 MiB
  reclaimed 64 live in-scope buffers across 4 arenas
  promoted buffer survived its arena
...
== TorchLean: all curated tests passed ==

Notes

  • Additive: no production op changed; allocation/finalization outside an arena is byte-for-byte the
    prior behavior.
  • Threading: arenaEnter/arenaExit and the allocations between them run on one driver thread;
    finalizers (which only ever unlink) may run on any thread and are mutex-safe. Concurrent arenas on
    multiple threads are out of scope for this first cut.

Follow-up: use-after-arena-free detector (commit eed66d6)

A buffer reclaimed by arenaExit keeps its Lean external object but has size 0 and freed/cached data;
using it afterwards is a use-after-free. It previously surfaced only as a cryptic size mismatch (N vs 0)
deep in a kernel — and slipped through silently when both operands were freed, since the bare size
check passes (0 == 0).

This adds an opt-in detector (TORCHLEAN_ARENA_DEBUG=1): arenaExit records the reclaiming epoch in a
new arena_freed_depth buffer field, and the require_same_size{2,3} choke point (which every
binary/ternary op already calls) asserts liveness before comparing sizes, so a stale operand panics
naming the op, operand, and epoch:

use-after-arena-free: torchlean_cuda_buffer_mul rhs was reclaimed by arena_exit at depth 0

When off (the default), the detector is one predicted branch on a cached env read plus an 8-byte field;
the suite runs at the same wall-clock on or off.

Regression test runArenaDetectorDeathTest (in nn_tests_suite, runs in the normal suite on both
the CUDA build and the CPU stub). A detected UAF must panic, which cannot be caught in-process, so it
forks the suite binary per configuration and inspects the outcome:

  • positive — detector on + a planted UAF ⇒ the child aborts with the message;
  • negative — detector on + a valid promotion reused in a binary op (through the detector's own
    choke point) ⇒ clean exit, so the detector never fires on a kept buffer;
  • control — detector off + the planted UAF ⇒ clean exit (the silent slip the detector closes).

It already earned its keep: it pinpointed a real cross-call use-after-arena-free in a downstream
nested-arena consumer — a loop-invariant constant, CSE'd to one allocation shared across repeated fit
calls, allocated in the first call's outer epoch and reused after that epoch freed it — which the bare
size mismatch message could not diagnose.

scripts/checks/check.sh --ci-all --cuda: build + test + lint all green.

Long pure eager loops (a foldl of Buffer -> Buffer ops with no IO
sequencing) keep every intermediate Buffer GC-reachable until the final
readback, so the reference-counting finalizers that free device memory
never run and the working set grows with the loop length. Explicit
release cannot help from inside a pure carrier: there is no IO point at
which to call it.

withCudaArena opens an allocation epoch; every device buffer allocated
while it is open is tracked. On exit it frees the device data of all of
them -- reachable or not -- except a small `keep` set (the step's
results), which are promoted to the parent epoch (or left to ordinary
finalization at the outermost level). This matches a training loop's
natural phase boundary (one step / one fold) and keeps the pure carrier
unchanged; only the driver wraps each step.

Mechanism (csrc/cuda/common/torchlean_cuda_arena.h, shared by the CUDA
backend and the CPU stub): each tracked buffer points at a heap reg, and
back; a buffer finalized mid-scope flips its reg's alive flag instead of
leaving a dangling pointer, so the exit walk skips it. Registry mutation
is mutex-guarded; the no-arena fast paths take no lock, so allocation and
finalization outside an arena are unchanged. Wiring: buffer_alloc
registers, finalize/drop_unboxed unlink, plus two IO export wrappers; the
buffer struct gains one pointer field.

Lean API: Buffer.arenaEnter / arenaExit / withCudaArena (Buffer.lean).

Test: runArenaStress (NN/Tests/Runtime/Cuda/Stress.lean, wired into
Stress.run) drives k live, never-released buffers through a scope -- the
case release cannot reach -- holding them alive across the exit so only
the arena can free them, and asserts via the allocator telemetry that
(1) they are live in-scope, (2) the arena reclaims exactly them at exit
with live bytes restored, and (3) a kept buffer is promoted and stays
usable. The assertions use only the shared allocator counters, so the
test is backend-agnostic; verified on the real CUDA backend, and the CPU
stub compiles cleanly for GPU-less CI.
A buffer reclaimed by `arena_exit` keeps its Lean external object but has
size 0 and freed/cached data; using it afterwards is a use-after-free.
Before, that surfaced only as a cryptic "size mismatch (N vs 0)" deep in a
kernel -- and slipped through silently when both operands were freed, since
the bare size check passes (0 == 0).

Add an opt-in detector (TORCHLEAN_ARENA_DEBUG=1): `arena_exit` records the
reclaiming epoch in a new `arena_freed_depth` buffer field, and the
`require_same_size{2,3}` choke point (which every binary/ternary op already
calls) asserts liveness before comparing sizes, so a stale operand panics
naming the op and epoch:

  use-after-arena-free: torchlean_cuda_buffer_add lhs was reclaimed by
  arena_exit at depth 0

When off (the default) the detector is one predicted branch on a cached env
read plus an 8-byte field; the suite runs at the same wall-clock on or off.

Regression coverage: `runArenaDetectorDeathTest` in nn_tests_suite (runs on
both the CUDA build and the CPU stub). A detected UAF must panic, which
cannot be caught in-process, so it forks the suite binary per configuration
and inspects the outcome:
  * positive -- detector on + planted UAF => child aborts with the message
  * negative -- detector on + a valid promotion reused in a binary op
                (through the detector's own choke point) => clean exit
  * control  -- detector off + planted UAF => clean exit (silent slip)

check.sh --ci-all --cuda: build + test + lint all green.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant