Skip to content

Add a leak-bound allocator stress regression for long eager CUDA loops#8

Open
NicolasRouquette wants to merge 1 commit into
lean-dojo:mainfrom
NicolasRouquette:cuda-allocator-leak-bound-stress
Open

Add a leak-bound allocator stress regression for long eager CUDA loops#8
NicolasRouquette wants to merge 1 commit into
lean-dojo:mainfrom
NicolasRouquette:cuda-allocator-leak-bound-stress

Conversation

@NicolasRouquette

Copy link
Copy Markdown
Contributor

Summary

Adds runLeakBoundStress to the CUDA runtime stress suite
(NN/Tests/Runtime/Cuda/Stress.lean) — a regression test that turns the existing
allocator telemetry into an asserted leak-bound invariant for long eager loops.

Large CUDA device buffers are wrapped as Lean external objects, so they are freed by
reference-counting finalizers. In long training loops those finalizers can lag, which is
exactly why the runtime carries an explicit release discipline (Buffer.release /
releaseThen / releaseManyThen / tape cleanup). The runtime already exposes
per-allocator counters (Buffer.allocatorStats: live/peak bytes, alloc/free counts), but
nothing asserted an invariant from them. This test closes that gap.

What it checks

A small eager workload (fulladdmulreduceSum) is run through the explicit
release discipline for two equal blocks of steps. It then asserts:

  1. The release calls fired — each step freed exactly its allocations (sum of the
    release return codes), which also keeps Lean from eliminating the frees as dead code.
  2. Bounded working set — net live allocations and live bytes after 2× the steps match
    the 1× snapshot. A per-step leak would grow linearly and fail here. This is the
    leak-bound invariant.
  3. Returns to baseline — with every buffer released, the loop's net live allocations
    return to the pre-loop baseline.

Why

It regression-gates the long-training-loop finalizer-lag class that the release discipline
exists to prevent: if a future op allocates without releasing, the working set grows with
step count and this test fails.

Notes

  • Runs on the CPU stub (no GPU required) via the shared allocator-counter parity, so it
    works in ordinary CI; under -K cuda=true it exercises the real CUDA allocator.
  • Additive only: one new def leakStep, one new def runLeakBoundStress, wired into
    Stress.run. No production code changed.

Observed locally (lake exe nn_tests_suite), working set flat across step count:

baseline:        live=0.080238 MiB  allocs=954   frees=476    net=478
after 128 steps: live=0.080238 MiB  allocs=1594  frees=1116   net=478
after 256 steps: live=0.080238 MiB  allocs=2234  frees=1756   net=478

🤖 Generated with Claude Code

Add `runLeakBoundStress` to the CUDA stress suite. It drives a small
eager workload (full/add/mul/reduceSum) through the explicit
`Buffer.release` discipline for two equal blocks of steps and asserts the
device working set is bounded independent of step count, using the
allocator telemetry that already exists (`Buffer.allocatorStats`):

  - each step freed exactly its allocations, so the release calls fired
    and were not eliminated as dead code;
  - net live allocations and live bytes after 2x the steps match the 1x
    snapshot (the leak-bound invariant);
  - with every buffer released the loop returns to the baseline working
    set.

This regression-gates the long-training-loop finalizer-lag class that the
release discipline exists to prevent. Runs on the CPU stub (no GPU
required) through the shared allocator-counter parity.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant