feat(extraction): index CUDA <<<grid,block>>> kernel-launch call edges#534
Open
QuantFunc wants to merge 1 commit into
Open
feat(extraction): index CUDA <<<grid,block>>> kernel-launch call edges#534QuantFunc wants to merge 1 commit into
<<<grid,block>>> kernel-launch call edges#534QuantFunc wants to merge 1 commit into
Conversation
…s call edges
tree-sitter-cpp parses most of CUDA (function defs, classes, templates,
__global__/__device__/__host__ qualifiers absorbed as attributes), so .cu/.cuh
files already get rich extraction once they're mapped to the cpp grammar.
What it does NOT parse is the CUDA-specific kernel-launch operator:
kernel_name<TemplateArgs><<<grid, block, smem, stream>>>(args);
Without it, `codegraph callers <kernel>` returned only the in-device callers
(other __device__ helpers in the same compilation unit) and missed every host
launch site — a fundamental edge in any CUDA codebase.
This adds two pieces:
1. `EXTENSION_MAP` now maps `.cu` / `.cuh` → `cpp` (was previously left as
unsupported and silently skipped).
2. A new post-processor `cuda-kernel-launch-postprocess.ts` runs after the
tree-sitter pass on .cu/.cuh sources. It:
- masks comments and string/char literals so they can't trigger false
matches on `<<<` inside docs or printf templates;
- scans for every `<<<...>>>` launch operator, then back-scans to
resolve the kernel name across a single level of optional template
args (`kernel<half><<<...>>>`);
- forward-verifies that a call `(...)` tail follows the launch config
(rejects stray `<<<` that aren't real launches);
- looks up the smallest enclosing function/method node and attaches
one `UnresolvedReference { referenceKind: 'calls' }` per launch site
so the existing cross-file resolver turns it into a proper edge.
End-to-end on a real CUDA codebase (QuantFunc, 5,600 files, 268k nodes, 460k
edges — 49 .cu + 19 .cuh):
before: callers(rope_interleaved_kernel) → ∅
after: callers(rope_interleaved_kernel) → apply_rope_interleaved_inplace
(host wrapper, line 68) ✓
before: callers(fp8_amax_reduce_kernel) → ∅
after: callers(fp8_amax_reduce_kernel) → fp8_conv_cudnn_nhwc ✓
fp8_conv_im2col_cublaslt_nhwc ✓
Coverage is intentionally narrow: this resolves the host→__global__ edge.
In-kernel device-to-device calls (`__device__ __forceinline__ to_float<T>(...)`
invoked from a kernel body) were already captured by tree-sitter-cpp's normal
call-expression handling, so they keep working without changes.
Test coverage: 17 unit tests covering the common shapes (untyped, templated,
4-arg with stream, qualified-name, multi-launch-per-file) plus the negative
cases historical grep patterns trip on (`std::cout << x`, `<<<` inside line/
block comments, `<<<` inside string literals, malformed launches without a
trailing call). Plus end-to-end extractFromSource on synthetic .cu/.cpp blobs
to confirm dispatch wiring + .cpp non-regression.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
tree-sitter-cpp parses most of CUDA (function defs, classes, templates,
__global__/__device__/__host__qualifiers absorbed as attributes), but it does NOT model the CUDA-specific kernel-launch operator:Without that edge,
codegraph callers <kernel>returns only the in-device callers (other__device__helpers in the same compilation unit) and misses every host launch site — a fundamental edge in any CUDA codebase.This PR adds:
EXTENSION_MAPnow maps.cu/.cuh→cpp(previously unsupported and silently skipped).cuda-kernel-launch-postprocess.tsruns after the tree-sitter pass on.cu/.cuhsources:<<<inside docs/printf strings can't false-match;<<<...>>>, then back-scans to resolve the kernel name across a single level of optional template args (kernel<half><<<...>>>);(...)tail follows the launch config (rejects stray<<<that aren't real launches);UnresolvedReference { referenceKind: 'calls' }per launch site so the existing cross-file resolver turns it into a proper edge.Before / after on a real CUDA codebase
Validated on a CUDA-heavy ML inference engine (5,600 files, 268k nodes, 460k edges, 49
.cu+ 19.cuh):callers(rope_interleaved_kernel)apply_rope_interleaved_inplace(host wrapper,.cu:68) ✓callers(modulate_ada_ln_kernel)modulate_ada_ln(.cu:114) ✓callers(groupnorm_kernel)groupnorm(.cu:133) ✓callers(fp8_amax_reduce_kernel)fp8_conv_cudnn_nhwc✓ +fp8_conv_im2col_cublaslt_nhwc✓The dual-callsite case (a single kernel launched from two host paths) is captured correctly.
What is intentionally NOT in scope
In-kernel device-to-device calls (
__device__ __forceinline__ to_float<T>(...)invoked from a kernel body) were already captured by tree-sitter-cpp's normal call-expression handling.callers(to_float)returns the 20 expected sites without changes. This PR is narrowly the host→__global__boundary that the grammar genuinely doesn't see.Test plan
__tests__/cuda-kernel-launch.test.tscovering:kernel<T>), 4-arg (withsmem, stream), qualified-name (ns::Cls::method)std::cout << x,<<<inside////* */comments,<<<inside string literals, malformed launches without trailing(extractFromSourceon a synthetic.cuconfirming dispatch + host-attribution.cppconfirms ordinary C++ call extraction is unaffected<<<(fast guard returns the sameExtractionResultobject reference — verified by test)🤖 Generated with Claude Code