Add withCudaArena: scoped device-memory reclamation for long eager loops#9
Open
NicolasRouquette wants to merge 2 commits into
Open
Add withCudaArena: scoped device-memory reclamation for long eager loops#9NicolasRouquette wants to merge 2 commits into
NicolasRouquette wants to merge 2 commits into
Conversation
Long pure eager loops (a foldl of Buffer -> Buffer ops with no IO sequencing) keep every intermediate Buffer GC-reachable until the final readback, so the reference-counting finalizers that free device memory never run and the working set grows with the loop length. Explicit release cannot help from inside a pure carrier: there is no IO point at which to call it. withCudaArena opens an allocation epoch; every device buffer allocated while it is open is tracked. On exit it frees the device data of all of them -- reachable or not -- except a small `keep` set (the step's results), which are promoted to the parent epoch (or left to ordinary finalization at the outermost level). This matches a training loop's natural phase boundary (one step / one fold) and keeps the pure carrier unchanged; only the driver wraps each step. Mechanism (csrc/cuda/common/torchlean_cuda_arena.h, shared by the CUDA backend and the CPU stub): each tracked buffer points at a heap reg, and back; a buffer finalized mid-scope flips its reg's alive flag instead of leaving a dangling pointer, so the exit walk skips it. Registry mutation is mutex-guarded; the no-arena fast paths take no lock, so allocation and finalization outside an arena are unchanged. Wiring: buffer_alloc registers, finalize/drop_unboxed unlink, plus two IO export wrappers; the buffer struct gains one pointer field. Lean API: Buffer.arenaEnter / arenaExit / withCudaArena (Buffer.lean). Test: runArenaStress (NN/Tests/Runtime/Cuda/Stress.lean, wired into Stress.run) drives k live, never-released buffers through a scope -- the case release cannot reach -- holding them alive across the exit so only the arena can free them, and asserts via the allocator telemetry that (1) they are live in-scope, (2) the arena reclaims exactly them at exit with live bytes restored, and (3) a kept buffer is promoted and stays usable. The assertions use only the shared allocator counters, so the test is backend-agnostic; verified on the real CUDA backend, and the CPU stub compiles cleanly for GPU-less CI.
A buffer reclaimed by `arena_exit` keeps its Lean external object but has
size 0 and freed/cached data; using it afterwards is a use-after-free.
Before, that surfaced only as a cryptic "size mismatch (N vs 0)" deep in a
kernel -- and slipped through silently when both operands were freed, since
the bare size check passes (0 == 0).
Add an opt-in detector (TORCHLEAN_ARENA_DEBUG=1): `arena_exit` records the
reclaiming epoch in a new `arena_freed_depth` buffer field, and the
`require_same_size{2,3}` choke point (which every binary/ternary op already
calls) asserts liveness before comparing sizes, so a stale operand panics
naming the op and epoch:
use-after-arena-free: torchlean_cuda_buffer_add lhs was reclaimed by
arena_exit at depth 0
When off (the default) the detector is one predicted branch on a cached env
read plus an 8-byte field; the suite runs at the same wall-clock on or off.
Regression coverage: `runArenaDetectorDeathTest` in nn_tests_suite (runs on
both the CUDA build and the CPU stub). A detected UAF must panic, which
cannot be caught in-process, so it forks the suite binary per configuration
and inspects the outcome:
* positive -- detector on + planted UAF => child aborts with the message
* negative -- detector on + a valid promotion reused in a binary op
(through the detector's own choke point) => clean exit
* control -- detector off + planted UAF => clean exit (silent slip)
check.sh --ci-all --cuda: build + test + lint all green.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
withCudaArena, a scoped device-memory arena for the eager CUDA backend.Large CUDA buffers are wrapped as Lean external objects freed by reference-counting finalizers. In a
long pure eager loop — a
foldlofBuffer → Bufferops with no IO sequencing — everyintermediate
Bufferstays GC-reachable until the final readback, so the finalizers never run andthe device working set grows with the loop length. Explicit
releasecannot help from inside a purecarrier: there is no IO point at which to call it.
withCudaArenaopens an allocation epoch; every device buffer allocated while it is open is tracked.On exit it frees the device data of all of them — reachable or not — except a small
keepset(the step's results), which are promoted to the parent epoch (or left to ordinary finalization at the
outermost level). This matches a training loop's natural phase boundary (one step / one fold), and it
is compatible with a pure carrier: the carrier is unchanged; only the driver wraps each step.
Mechanism
csrc/cuda/common/torchlean_cuda_arena.h(a new shared header, used by both the CUDA backend and theCPU stub) holds an epoch registry:
aliveflag instead of leaving a danglingpointer, so the exit walk skips it (no use-after-free, no double-free);
finalization outside an arena are unchanged.
Wiring is minimal:
torchlean_cuda_buffer_allocregisters,finalize/drop_unboxedunlink, plus twoIO export wrappers (
torchlean_cuda_arena_enter/_exit). The buffer struct gains one pointer field.Lean API
Buffer.arenaEnter : IO Unit Buffer.arenaExit (keep : @& Array Buffer) : IO Unit Buffer.withCudaArena (keep : α → Array Buffer) (body : IO α) : IO α -- exception-safeContract: every buffer that must outlive the scope must appear in
keep; touching any otherin-scope buffer after
arenaExitis a use-after-free, the same hazard as reading a buffer afterrelease.Test
runArenaStress(NN/Tests/Runtime/Cuda/Stress.lean, wired intoStress.run) drivesklive,never-released buffers through a scope — the case
releasecannot reach — holding them aliveacross the exit so only the arena can free them. Using the existing allocator telemetry it asserts:
kbuffers are live in-scope (allocated, not yet freed);freeCount += k, live bytes restored — even though they arestill referenced;
The assertions read only the shared allocator counters (
Buffer.allocatorStats), which have CPU-stubparity, and the arena registry itself is one header shared by both backends — so the test is
backend-agnostic. Verified by execution on the real CUDA backend; the CPU-stub backend compiles
cleanly (so it builds in GPU-less CI).
Observed on the real CUDA backend (
-K cuda=true):Notes
prior behavior.
arenaEnter/arenaExitand the allocations between them run on one driver thread;finalizers (which only ever unlink) may run on any thread and are mutex-safe. Concurrent arenas on
multiple threads are out of scope for this first cut.
Follow-up: use-after-arena-free detector (commit eed66d6)
A buffer reclaimed by
arenaExitkeeps its Lean external object but has size 0 and freed/cached data;using it afterwards is a use-after-free. It previously surfaced only as a cryptic
size mismatch (N vs 0)deep in a kernel — and slipped through silently when both operands were freed, since the bare size
check passes (
0 == 0).This adds an opt-in detector (
TORCHLEAN_ARENA_DEBUG=1):arenaExitrecords the reclaiming epoch in anew
arena_freed_depthbuffer field, and therequire_same_size{2,3}choke point (which everybinary/ternary op already calls) asserts liveness before comparing sizes, so a stale operand panics
naming the op, operand, and epoch:
When off (the default), the detector is one predicted branch on a cached env read plus an 8-byte field;
the suite runs at the same wall-clock on or off.
Regression test
runArenaDetectorDeathTest(innn_tests_suite, runs in the normal suite on boththe CUDA build and the CPU stub). A detected UAF must panic, which cannot be caught in-process, so it
forks the suite binary per configuration and inspects the outcome:
choke point) ⇒ clean exit, so the detector never fires on a kept buffer;
It already earned its keep: it pinpointed a real cross-call use-after-arena-free in a downstream
nested-arena consumer — a loop-invariant constant, CSE'd to one allocation shared across repeated fit
calls, allocated in the first call's outer epoch and reused after that epoch freed it — which the bare
size mismatchmessage could not diagnose.scripts/checks/check.sh --ci-all --cuda: build + test + lint all green.