Add a leak-bound allocator stress regression for long eager CUDA loops#8
Open
NicolasRouquette wants to merge 1 commit into
Open
Conversation
Add `runLeakBoundStress` to the CUDA stress suite. It drives a small
eager workload (full/add/mul/reduceSum) through the explicit
`Buffer.release` discipline for two equal blocks of steps and asserts the
device working set is bounded independent of step count, using the
allocator telemetry that already exists (`Buffer.allocatorStats`):
- each step freed exactly its allocations, so the release calls fired
and were not eliminated as dead code;
- net live allocations and live bytes after 2x the steps match the 1x
snapshot (the leak-bound invariant);
- with every buffer released the loop returns to the baseline working
set.
This regression-gates the long-training-loop finalizer-lag class that the
release discipline exists to prevent. Runs on the CPU stub (no GPU
required) through the shared allocator-counter parity.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
runLeakBoundStressto the CUDA runtime stress suite(
NN/Tests/Runtime/Cuda/Stress.lean) — a regression test that turns the existingallocator telemetry into an asserted leak-bound invariant for long eager loops.
Large CUDA device buffers are wrapped as Lean external objects, so they are freed by
reference-counting finalizers. In long training loops those finalizers can lag, which is
exactly why the runtime carries an explicit release discipline (
Buffer.release/releaseThen/releaseManyThen/ tapecleanup). The runtime already exposesper-allocator counters (
Buffer.allocatorStats: live/peak bytes, alloc/free counts), butnothing asserted an invariant from them. This test closes that gap.
What it checks
A small eager workload (
full→add→mul→reduceSum) is run through the explicitrelease discipline for two equal blocks of steps. It then asserts:
release return codes), which also keeps Lean from eliminating the frees as dead code.
the 1× snapshot. A per-step leak would grow linearly and fail here. This is the
leak-bound invariant.
return to the pre-loop baseline.
Why
It regression-gates the long-training-loop finalizer-lag class that the release discipline
exists to prevent: if a future op allocates without releasing, the working set grows with
step count and this test fails.
Notes
works in ordinary CI; under
-K cuda=trueit exercises the real CUDA allocator.def leakStep, one newdef runLeakBoundStress, wired intoStress.run. No production code changed.Observed locally (
lake exe nn_tests_suite), working set flat across step count:🤖 Generated with Claude Code