feat(extraction): index CUDA `<<<grid,block>>>` kernel-launch call edges by QuantFunc · Pull Request #534 · colbymchenry/codegraph

QuantFunc · 2026-05-28T15:36:41Z

Summary

tree-sitter-cpp parses most of CUDA (function defs, classes, templates, __global__/__device__/__host__ qualifiers absorbed as attributes), but it does NOT model the CUDA-specific kernel-launch operator:

kernel_name<TemplateArgs><<<grid, block, smem, stream>>>(args);

Without that edge, codegraph callers <kernel> returns only the in-device callers (other __device__ helpers in the same compilation unit) and misses every host launch site — a fundamental edge in any CUDA codebase.

This PR adds:

EXTENSION_MAP now maps .cu / .cuh → cpp (previously unsupported and silently skipped).
A new post-processor cuda-kernel-launch-postprocess.ts runs after the tree-sitter pass on .cu/.cuh sources:
- masks comments + string/char literals so <<< inside docs/printf strings can't false-match;
- scans for every <<<...>>>, then back-scans to resolve the kernel name across a single level of optional template args (kernel<half><<<...>>>);
- forward-verifies that a call (...) tail follows the launch config (rejects stray <<< that aren't real launches);
- looks up the smallest enclosing function/method node, attaches one UnresolvedReference { referenceKind: 'calls' } per launch site so the existing cross-file resolver turns it into a proper edge.

Before / after on a real CUDA codebase

Validated on a CUDA-heavy ML inference engine (5,600 files, 268k nodes, 460k edges, 49 .cu + 19 .cuh):

Query	Before	After
`callers(rope_interleaved_kernel)`	∅	`apply_rope_interleaved_inplace` (host wrapper, `.cu:68`) ✓
`callers(modulate_ada_ln_kernel)`	∅	`modulate_ada_ln` (`.cu:114`) ✓
`callers(groupnorm_kernel)`	∅	`groupnorm` (`.cu:133`) ✓
`callers(fp8_amax_reduce_kernel)`	∅	`fp8_conv_cudnn_nhwc` ✓ + `fp8_conv_im2col_cublaslt_nhwc` ✓

The dual-callsite case (a single kernel launched from two host paths) is captured correctly.

What is intentionally NOT in scope

In-kernel device-to-device calls (__device__ __forceinline__ to_float<T>(...) invoked from a kernel body) were already captured by tree-sitter-cpp's normal call-expression handling. callers(to_float) returns the 20 expected sites without changes. This PR is narrowly the host→__global__ boundary that the grammar genuinely doesn't see.

Test plan

17 unit tests in __tests__/cuda-kernel-launch.test.ts covering:
- common shapes: untyped launch, templated launch (kernel<T>), 4-arg (with smem, stream), qualified-name (ns::Cls::method)
- multi-launch per file, no-whitespace edge case
- negative cases historical grep patterns trip on: std::cout << x, <<< inside ////* */ comments, <<< inside string literals, malformed launches without trailing (
- end-to-end extractFromSource on a synthetic .cu confirming dispatch + host-attribution
- non-regression: a synthetic .cpp confirms ordinary C++ call extraction is unaffected
Reindexed a 5,600-file real-world CUDA project; the 4 cross-host/device-edge queries above all return the expected sites
No additional edges emitted when source contains no <<< (fast guard returns the same ExtractionResult object reference — verified by test)

🤖 Generated with Claude Code

…s call edges tree-sitter-cpp parses most of CUDA (function defs, classes, templates, __global__/__device__/__host__ qualifiers absorbed as attributes), so .cu/.cuh files already get rich extraction once they're mapped to the cpp grammar. What it does NOT parse is the CUDA-specific kernel-launch operator: kernel_name<TemplateArgs><<<grid, block, smem, stream>>>(args); Without it, `codegraph callers <kernel>` returned only the in-device callers (other __device__ helpers in the same compilation unit) and missed every host launch site — a fundamental edge in any CUDA codebase. This adds two pieces: 1. `EXTENSION_MAP` now maps `.cu` / `.cuh` → `cpp` (was previously left as unsupported and silently skipped). 2. A new post-processor `cuda-kernel-launch-postprocess.ts` runs after the tree-sitter pass on .cu/.cuh sources. It: - masks comments and string/char literals so they can't trigger false matches on `<<<` inside docs or printf templates; - scans for every `<<<...>>>` launch operator, then back-scans to resolve the kernel name across a single level of optional template args (`kernel<half><<<...>>>`); - forward-verifies that a call `(...)` tail follows the launch config (rejects stray `<<<` that aren't real launches); - looks up the smallest enclosing function/method node and attaches one `UnresolvedReference { referenceKind: 'calls' }` per launch site so the existing cross-file resolver turns it into a proper edge. End-to-end on a real CUDA codebase (QuantFunc, 5,600 files, 268k nodes, 460k edges — 49 .cu + 19 .cuh): before: callers(rope_interleaved_kernel) → ∅ after: callers(rope_interleaved_kernel) → apply_rope_interleaved_inplace (host wrapper, line 68) ✓ before: callers(fp8_amax_reduce_kernel) → ∅ after: callers(fp8_amax_reduce_kernel) → fp8_conv_cudnn_nhwc ✓ fp8_conv_im2col_cublaslt_nhwc ✓ Coverage is intentionally narrow: this resolves the host→__global__ edge. In-kernel device-to-device calls (`__device__ __forceinline__ to_float<T>(...)` invoked from a kernel body) were already captured by tree-sitter-cpp's normal call-expression handling, so they keep working without changes. Test coverage: 17 unit tests covering the common shapes (untyped, templated, 4-arg with stream, qualified-name, multi-launch-per-file) plus the negative cases historical grep patterns trip on (`std::cout << x`, `<<<` inside line/ block comments, `<<<` inside string literals, malformed launches without a trailing call). Plus end-to-end extractFromSource on synthetic .cu/.cpp blobs to confirm dispatch wiring + .cpp non-regression. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(extraction): index CUDA `<<<grid,block>>>` kernel-launch call edges#534

feat(extraction): index CUDA `<<<grid,block>>>` kernel-launch call edges#534
QuantFunc wants to merge 1 commit into
colbymchenry:mainfrom
QuantFunc:feat/cuda-kernel-launch

QuantFunc commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

QuantFunc commented May 28, 2026

Summary

Before / after on a real CUDA codebase

What is intentionally NOT in scope

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant