Skip to content

feat(extraction): index CUDA <<<grid,block>>> kernel-launch call edges#534

Open
QuantFunc wants to merge 1 commit into
colbymchenry:mainfrom
QuantFunc:feat/cuda-kernel-launch
Open

feat(extraction): index CUDA <<<grid,block>>> kernel-launch call edges#534
QuantFunc wants to merge 1 commit into
colbymchenry:mainfrom
QuantFunc:feat/cuda-kernel-launch

Conversation

@QuantFunc
Copy link
Copy Markdown

Summary

tree-sitter-cpp parses most of CUDA (function defs, classes, templates, __global__/__device__/__host__ qualifiers absorbed as attributes), but it does NOT model the CUDA-specific kernel-launch operator:

kernel_name<TemplateArgs><<<grid, block, smem, stream>>>(args);

Without that edge, codegraph callers <kernel> returns only the in-device callers (other __device__ helpers in the same compilation unit) and misses every host launch site — a fundamental edge in any CUDA codebase.

This PR adds:

  1. EXTENSION_MAP now maps .cu / .cuhcpp (previously unsupported and silently skipped).
  2. A new post-processor cuda-kernel-launch-postprocess.ts runs after the tree-sitter pass on .cu/.cuh sources:
    • masks comments + string/char literals so <<< inside docs/printf strings can't false-match;
    • scans for every <<<...>>>, then back-scans to resolve the kernel name across a single level of optional template args (kernel<half><<<...>>>);
    • forward-verifies that a call (...) tail follows the launch config (rejects stray <<< that aren't real launches);
    • looks up the smallest enclosing function/method node, attaches one UnresolvedReference { referenceKind: 'calls' } per launch site so the existing cross-file resolver turns it into a proper edge.

Before / after on a real CUDA codebase

Validated on a CUDA-heavy ML inference engine (5,600 files, 268k nodes, 460k edges, 49 .cu + 19 .cuh):

Query Before After
callers(rope_interleaved_kernel) apply_rope_interleaved_inplace (host wrapper, .cu:68) ✓
callers(modulate_ada_ln_kernel) modulate_ada_ln (.cu:114) ✓
callers(groupnorm_kernel) groupnorm (.cu:133) ✓
callers(fp8_amax_reduce_kernel) fp8_conv_cudnn_nhwc ✓ + fp8_conv_im2col_cublaslt_nhwc

The dual-callsite case (a single kernel launched from two host paths) is captured correctly.

What is intentionally NOT in scope

In-kernel device-to-device calls (__device__ __forceinline__ to_float<T>(...) invoked from a kernel body) were already captured by tree-sitter-cpp's normal call-expression handling. callers(to_float) returns the 20 expected sites without changes. This PR is narrowly the host→__global__ boundary that the grammar genuinely doesn't see.

Test plan

  • 17 unit tests in __tests__/cuda-kernel-launch.test.ts covering:
    • common shapes: untyped launch, templated launch (kernel<T>), 4-arg (with smem, stream), qualified-name (ns::Cls::method)
    • multi-launch per file, no-whitespace edge case
    • negative cases historical grep patterns trip on: std::cout << x, <<< inside ////* */ comments, <<< inside string literals, malformed launches without trailing (
    • end-to-end extractFromSource on a synthetic .cu confirming dispatch + host-attribution
    • non-regression: a synthetic .cpp confirms ordinary C++ call extraction is unaffected
  • Reindexed a 5,600-file real-world CUDA project; the 4 cross-host/device-edge queries above all return the expected sites
  • No additional edges emitted when source contains no <<< (fast guard returns the same ExtractionResult object reference — verified by test)

🤖 Generated with Claude Code

…s call edges

tree-sitter-cpp parses most of CUDA (function defs, classes, templates,
__global__/__device__/__host__ qualifiers absorbed as attributes), so .cu/.cuh
files already get rich extraction once they're mapped to the cpp grammar.

What it does NOT parse is the CUDA-specific kernel-launch operator:

    kernel_name<TemplateArgs><<<grid, block, smem, stream>>>(args);

Without it, `codegraph callers <kernel>` returned only the in-device callers
(other __device__ helpers in the same compilation unit) and missed every host
launch site — a fundamental edge in any CUDA codebase.

This adds two pieces:

  1. `EXTENSION_MAP` now maps `.cu` / `.cuh` → `cpp` (was previously left as
     unsupported and silently skipped).

  2. A new post-processor `cuda-kernel-launch-postprocess.ts` runs after the
     tree-sitter pass on .cu/.cuh sources. It:
       - masks comments and string/char literals so they can't trigger false
         matches on `<<<` inside docs or printf templates;
       - scans for every `<<<...>>>` launch operator, then back-scans to
         resolve the kernel name across a single level of optional template
         args (`kernel<half><<<...>>>`);
       - forward-verifies that a call `(...)` tail follows the launch config
         (rejects stray `<<<` that aren't real launches);
       - looks up the smallest enclosing function/method node and attaches
         one `UnresolvedReference { referenceKind: 'calls' }` per launch site
         so the existing cross-file resolver turns it into a proper edge.

End-to-end on a real CUDA codebase (QuantFunc, 5,600 files, 268k nodes, 460k
edges — 49 .cu + 19 .cuh):

  before:  callers(rope_interleaved_kernel)  →  ∅
  after:   callers(rope_interleaved_kernel)  →  apply_rope_interleaved_inplace
                                                  (host wrapper, line 68) ✓

  before:  callers(fp8_amax_reduce_kernel)   →  ∅
  after:   callers(fp8_amax_reduce_kernel)   →  fp8_conv_cudnn_nhwc           ✓
                                                fp8_conv_im2col_cublaslt_nhwc ✓

Coverage is intentionally narrow: this resolves the host→__global__ edge.
In-kernel device-to-device calls (`__device__ __forceinline__ to_float<T>(...)`
invoked from a kernel body) were already captured by tree-sitter-cpp's normal
call-expression handling, so they keep working without changes.

Test coverage: 17 unit tests covering the common shapes (untyped, templated,
4-arg with stream, qualified-name, multi-launch-per-file) plus the negative
cases historical grep patterns trip on (`std::cout << x`, `<<<` inside line/
block comments, `<<<` inside string literals, malformed launches without a
trailing call). Plus end-to-end extractFromSource on synthetic .cu/.cpp blobs
to confirm dispatch wiring + .cpp non-regression.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant