feat(cuda): bound the device block reuse cache with an env byte cap#10
Open
NicolasRouquette wants to merge 3 commits into
Open
feat(cuda): bound the device block reuse cache with an env byte cap#10NicolasRouquette wants to merge 3 commits into
NicolasRouquette wants to merge 3 commits into
Conversation
Long pure eager loops (a foldl of Buffer -> Buffer ops with no IO sequencing) keep every intermediate Buffer GC-reachable until the final readback, so the reference-counting finalizers that free device memory never run and the working set grows with the loop length. Explicit release cannot help from inside a pure carrier: there is no IO point at which to call it. withCudaArena opens an allocation epoch; every device buffer allocated while it is open is tracked. On exit it frees the device data of all of them -- reachable or not -- except a small `keep` set (the step's results), which are promoted to the parent epoch (or left to ordinary finalization at the outermost level). This matches a training loop's natural phase boundary (one step / one fold) and keeps the pure carrier unchanged; only the driver wraps each step. Mechanism (csrc/cuda/common/torchlean_cuda_arena.h, shared by the CUDA backend and the CPU stub): each tracked buffer points at a heap reg, and back; a buffer finalized mid-scope flips its reg's alive flag instead of leaving a dangling pointer, so the exit walk skips it. Registry mutation is mutex-guarded; the no-arena fast paths take no lock, so allocation and finalization outside an arena are unchanged. Wiring: buffer_alloc registers, finalize/drop_unboxed unlink, plus two IO export wrappers; the buffer struct gains one pointer field. Lean API: Buffer.arenaEnter / arenaExit / withCudaArena (Buffer.lean). Test: runArenaStress (NN/Tests/Runtime/Cuda/Stress.lean, wired into Stress.run) drives k live, never-released buffers through a scope -- the case release cannot reach -- holding them alive across the exit so only the arena can free them, and asserts via the allocator telemetry that (1) they are live in-scope, (2) the arena reclaims exactly them at exit with live bytes restored, and (3) a kept buffer is promoted and stays usable. The assertions use only the shared allocator counters, so the test is backend-agnostic; verified on the real CUDA backend, and the CPU stub compiles cleanly for GPU-less CI.
A buffer reclaimed by `arena_exit` keeps its Lean external object but has
size 0 and freed/cached data; using it afterwards is a use-after-free.
Before, that surfaced only as a cryptic "size mismatch (N vs 0)" deep in a
kernel -- and slipped through silently when both operands were freed, since
the bare size check passes (0 == 0).
Add an opt-in detector (TORCHLEAN_ARENA_DEBUG=1): `arena_exit` records the
reclaiming epoch in a new `arena_freed_depth` buffer field, and the
`require_same_size{2,3}` choke point (which every binary/ternary op already
calls) asserts liveness before comparing sizes, so a stale operand panics
naming the op and epoch:
use-after-arena-free: torchlean_cuda_buffer_add lhs was reclaimed by
arena_exit at depth 0
When off (the default) the detector is one predicted branch on a cached env
read plus an 8-byte field; the suite runs at the same wall-clock on or off.
Regression coverage: `runArenaDetectorDeathTest` in nn_tests_suite (runs on
both the CUDA build and the CPU stub). A detected UAF must panic, which
cannot be caught in-process, so it forks the suite binary per configuration
and inspects the outcome:
* positive -- detector on + planted UAF => child aborts with the message
* negative -- detector on + a valid promotion reused in a binary op
(through the detector's own choke point) => clean exit
* control -- detector off + planted UAF => clean exit (silent slip)
check.sh --ci-all --cuda: build + test + lint all green.
The reuse cache (return/take of exact-size device blocks) grew without bound: a dropped buffer is returned to the cache, and the cache is only emptied on a cudaMalloc-failure flush or at process exit. A loop over many distinct sizes therefore accretes device memory that `liveBytes` does not see -- a returned block is accounted as freed before it is cached. Add an opt-in byte cap, `TORCHLEAN_CUDA_CACHE_CAP_BYTES` (0 = unbounded, the prior behaviour exactly): a returned block that would grow the cache past the cap is freed immediately -- after waiting on its completion event, exactly as the flush path does, so an in-flight kernel never reads freed memory -- instead of being cached. A running `cache_bytes` total under the cache mutex backs the decision and is surfaced as a new `AllocatorStats.cacheBytes` telemetry field (0 in the CPU stub, which keeps no cache and so has nothing to cap). Regression test `runCacheCapTest` (nn_tests_suite, runs on both the CUDA build and the CPU stub): the cap is read once natively, so -- like the arena detector death test -- it forks the suite binary per configuration. With a 1 MiB cap an 8 MiB return workload is bounded to exactly 1 MiB; with no cap (control) the same workload caches the full 8 MiB. On CUDA the test also checks the cap is the binding constraint (the cache fills to within one block of it). On the stub cacheBytes stays 0, so the cap bound holds trivially. scripts/checks/check.sh --ci-all --cuda: build + test + lint green.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds an opt-in byte cap on the CUDA buffer reuse cache (the exact-size device-block free list behind
take_cached_block/return_cached_block).The cache grows without bound: a dropped buffer is returned to it, and it is only emptied on a
cudaMalloc-failure flush or at process exit. A loop over many distinct buffer sizes therefore accretes device memory thatliveBytesdoes not even see — a returned block is accounted as freed before it is cached.TORCHLEAN_CUDA_CACHE_CAP_BYTES(0 = unbounded, the prior behaviour exactly) bounds it: a returned block that would grow the cache past the cap is freed immediately — after waiting on its completion event, exactly as the flush path does, so an in-flight kernel never reads freed memory — instead of being cached.What's in it
cache_bytestotal under the cache mutex (incremented on cache-push, decremented on reuse, reset on flush) backs the cap decision.AllocatorStats.cacheBytes(externtorchlean_cuda_allocator_cache_bytes; mirrored in the CPU stub returning0, which keeps no cache);AllocatorStats.formatgainscache=<MiB>.Test
runCacheCapTest(innn_tests_suite, runs on both the CUDA build and the CPU stub). The cap is read once natively, so — like the existing arena-detector death test — it forks the suite binary per configuration:On CUDA it also checks the cap is the binding constraint (the cache fills to within one block of it). On the stub
cacheBytesstays 0, so the cap bound holds trivially.scripts/checks/check.sh --ci-all --cuda: build + test + lint green.Note: stacks on #9
This branches off #9 (
cuda-arena-scope), reusing itsbuildArenaScratchtest helper and fork harness. Until #9 merges, this PR shows #9's two commits as well; review only the last commit (feat(cuda): bound the device block reuse cache with an env byte cap). Once #9 merges tomain, this PR reduces to that single commit. Merging #9 first is the intended order.