Skip to content

feat(cuda): bound the device block reuse cache with an env byte cap#10

Open
NicolasRouquette wants to merge 3 commits into
lean-dojo:mainfrom
NicolasRouquette:cuda-cache-byte-cap
Open

feat(cuda): bound the device block reuse cache with an env byte cap#10
NicolasRouquette wants to merge 3 commits into
lean-dojo:mainfrom
NicolasRouquette:cuda-cache-byte-cap

Conversation

@NicolasRouquette

Copy link
Copy Markdown
Contributor

Summary

Adds an opt-in byte cap on the CUDA buffer reuse cache (the exact-size device-block free list behind take_cached_block / return_cached_block).

The cache grows without bound: a dropped buffer is returned to it, and it is only emptied on a cudaMalloc-failure flush or at process exit. A loop over many distinct buffer sizes therefore accretes device memory that liveBytes does not even see — a returned block is accounted as freed before it is cached.

TORCHLEAN_CUDA_CACHE_CAP_BYTES (0 = unbounded, the prior behaviour exactly) bounds it: a returned block that would grow the cache past the cap is freed immediately — after waiting on its completion event, exactly as the flush path does, so an in-flight kernel never reads freed memory — instead of being cached.

What's in it

  • A running cache_bytes total under the cache mutex (incremented on cache-push, decremented on reuse, reset on flush) backs the cap decision.
  • New telemetry field AllocatorStats.cacheBytes (extern torchlean_cuda_allocator_cache_bytes; mirrored in the CPU stub returning 0, which keeps no cache); AllocatorStats.format gains cache=<MiB>.
  • CUDA-only: the stub frees on release and has no cache to cap.

Test

runCacheCapTest (in nn_tests_suite, runs on both the CUDA build and the CPU stub). The cap is read once natively, so — like the existing arena-detector death test — it forks the suite binary per configuration:

  • capped — a 1 MiB cap bounds an 8 MiB return workload to exactly 1 MiB;
  • control — with no cap the same workload caches the full 8 MiB.

On CUDA it also checks the cap is the binding constraint (the cache fills to within one block of it). On the stub cacheBytes stays 0, so the cap bound holds trivially.

scripts/checks/check.sh --ci-all --cuda: build + test + lint green.

Note: stacks on #9

This branches off #9 (cuda-arena-scope), reusing its buildArenaScratch test helper and fork harness. Until #9 merges, this PR shows #9's two commits as well; review only the last commit (feat(cuda): bound the device block reuse cache with an env byte cap). Once #9 merges to main, this PR reduces to that single commit. Merging #9 first is the intended order.

Long pure eager loops (a foldl of Buffer -> Buffer ops with no IO
sequencing) keep every intermediate Buffer GC-reachable until the final
readback, so the reference-counting finalizers that free device memory
never run and the working set grows with the loop length. Explicit
release cannot help from inside a pure carrier: there is no IO point at
which to call it.

withCudaArena opens an allocation epoch; every device buffer allocated
while it is open is tracked. On exit it frees the device data of all of
them -- reachable or not -- except a small `keep` set (the step's
results), which are promoted to the parent epoch (or left to ordinary
finalization at the outermost level). This matches a training loop's
natural phase boundary (one step / one fold) and keeps the pure carrier
unchanged; only the driver wraps each step.

Mechanism (csrc/cuda/common/torchlean_cuda_arena.h, shared by the CUDA
backend and the CPU stub): each tracked buffer points at a heap reg, and
back; a buffer finalized mid-scope flips its reg's alive flag instead of
leaving a dangling pointer, so the exit walk skips it. Registry mutation
is mutex-guarded; the no-arena fast paths take no lock, so allocation and
finalization outside an arena are unchanged. Wiring: buffer_alloc
registers, finalize/drop_unboxed unlink, plus two IO export wrappers; the
buffer struct gains one pointer field.

Lean API: Buffer.arenaEnter / arenaExit / withCudaArena (Buffer.lean).

Test: runArenaStress (NN/Tests/Runtime/Cuda/Stress.lean, wired into
Stress.run) drives k live, never-released buffers through a scope -- the
case release cannot reach -- holding them alive across the exit so only
the arena can free them, and asserts via the allocator telemetry that
(1) they are live in-scope, (2) the arena reclaims exactly them at exit
with live bytes restored, and (3) a kept buffer is promoted and stays
usable. The assertions use only the shared allocator counters, so the
test is backend-agnostic; verified on the real CUDA backend, and the CPU
stub compiles cleanly for GPU-less CI.
A buffer reclaimed by `arena_exit` keeps its Lean external object but has
size 0 and freed/cached data; using it afterwards is a use-after-free.
Before, that surfaced only as a cryptic "size mismatch (N vs 0)" deep in a
kernel -- and slipped through silently when both operands were freed, since
the bare size check passes (0 == 0).

Add an opt-in detector (TORCHLEAN_ARENA_DEBUG=1): `arena_exit` records the
reclaiming epoch in a new `arena_freed_depth` buffer field, and the
`require_same_size{2,3}` choke point (which every binary/ternary op already
calls) asserts liveness before comparing sizes, so a stale operand panics
naming the op and epoch:

  use-after-arena-free: torchlean_cuda_buffer_add lhs was reclaimed by
  arena_exit at depth 0

When off (the default) the detector is one predicted branch on a cached env
read plus an 8-byte field; the suite runs at the same wall-clock on or off.

Regression coverage: `runArenaDetectorDeathTest` in nn_tests_suite (runs on
both the CUDA build and the CPU stub). A detected UAF must panic, which
cannot be caught in-process, so it forks the suite binary per configuration
and inspects the outcome:
  * positive -- detector on + planted UAF => child aborts with the message
  * negative -- detector on + a valid promotion reused in a binary op
                (through the detector's own choke point) => clean exit
  * control  -- detector off + planted UAF => clean exit (silent slip)

check.sh --ci-all --cuda: build + test + lint all green.
The reuse cache (return/take of exact-size device blocks) grew without bound: a
dropped buffer is returned to the cache, and the cache is only emptied on a
cudaMalloc-failure flush or at process exit. A loop over many distinct sizes
therefore accretes device memory that `liveBytes` does not see -- a returned
block is accounted as freed before it is cached.

Add an opt-in byte cap, `TORCHLEAN_CUDA_CACHE_CAP_BYTES` (0 = unbounded, the
prior behaviour exactly): a returned block that would grow the cache past the cap
is freed immediately -- after waiting on its completion event, exactly as the
flush path does, so an in-flight kernel never reads freed memory -- instead of
being cached. A running `cache_bytes` total under the cache mutex backs the
decision and is surfaced as a new `AllocatorStats.cacheBytes` telemetry field
(0 in the CPU stub, which keeps no cache and so has nothing to cap).

Regression test `runCacheCapTest` (nn_tests_suite, runs on both the CUDA build
and the CPU stub): the cap is read once natively, so -- like the arena detector
death test -- it forks the suite binary per configuration. With a 1 MiB cap an
8 MiB return workload is bounded to exactly 1 MiB; with no cap (control) the same
workload caches the full 8 MiB. On CUDA the test also checks the cap is the
binding constraint (the cache fills to within one block of it). On the stub
cacheBytes stays 0, so the cap bound holds trivially.

scripts/checks/check.sh --ci-all --cuda: build + test + lint green.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant