Add a leak-bound allocator stress regression for long eager CUDA loops by NicolasRouquette · Pull Request #8 · lean-dojo/TorchLean

NicolasRouquette · 2026-06-23T04:54:42Z

Summary

Adds runLeakBoundStress to the CUDA runtime stress suite
(NN/Tests/Runtime/Cuda/Stress.lean) — a regression test that turns the existing
allocator telemetry into an asserted leak-bound invariant for long eager loops.

Large CUDA device buffers are wrapped as Lean external objects, so they are freed by
reference-counting finalizers. In long training loops those finalizers can lag, which is
exactly why the runtime carries an explicit release discipline (Buffer.release /
releaseThen / releaseManyThen / tape cleanup). The runtime already exposes
per-allocator counters (Buffer.allocatorStats: live/peak bytes, alloc/free counts), but
nothing asserted an invariant from them. This test closes that gap.

What it checks

A small eager workload (full → add → mul → reduceSum) is run through the explicit
release discipline for two equal blocks of steps. It then asserts:

The release calls fired — each step freed exactly its allocations (sum of the
release return codes), which also keeps Lean from eliminating the frees as dead code.
Bounded working set — net live allocations and live bytes after 2× the steps match
the 1× snapshot. A per-step leak would grow linearly and fail here. This is the
leak-bound invariant.
Returns to baseline — with every buffer released, the loop's net live allocations
return to the pre-loop baseline.

Why

It regression-gates the long-training-loop finalizer-lag class that the release discipline
exists to prevent: if a future op allocates without releasing, the working set grows with
step count and this test fails.

Notes

Runs on the CPU stub (no GPU required) via the shared allocator-counter parity, so it
works in ordinary CI; under -K cuda=true it exercises the real CUDA allocator.
Additive only: one new def leakStep, one new def runLeakBoundStress, wired into
Stress.run. No production code changed.

Observed locally (lake exe nn_tests_suite), working set flat across step count:

baseline:        live=0.080238 MiB  allocs=954   frees=476    net=478
after 128 steps: live=0.080238 MiB  allocs=1594  frees=1116   net=478
after 256 steps: live=0.080238 MiB  allocs=2234  frees=1756   net=478

🤖 Generated with Claude Code

Add `runLeakBoundStress` to the CUDA stress suite. It drives a small eager workload (full/add/mul/reduceSum) through the explicit `Buffer.release` discipline for two equal blocks of steps and asserts the device working set is bounded independent of step count, using the allocator telemetry that already exists (`Buffer.allocatorStats`): - each step freed exactly its allocations, so the release calls fired and were not eliminated as dead code; - net live allocations and live bytes after 2x the steps match the 1x snapshot (the leak-bound invariant); - with every buffer released the loop returns to the baseline working set. This regression-gates the long-training-loop finalizer-lag class that the release discipline exists to prevent. Runs on the CPU stub (no GPU required) through the shared allocator-counter parity.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add a leak-bound allocator stress regression for long eager CUDA loops#8

Add a leak-bound allocator stress regression for long eager CUDA loops#8
NicolasRouquette wants to merge 1 commit into
lean-dojo:mainfrom
NicolasRouquette:cuda-allocator-leak-bound-stress

NicolasRouquette commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

NicolasRouquette commented Jun 23, 2026

Summary

What it checks

Why

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant