selection_kernel_fusion in issue357#413
Conversation
Greptile SummaryThis PR adds an optional
Confidence Score: 5/5Safe to merge — the fused mask path is correctly implemented end-to-end, and previously raised issues (merge conflict, mask/key length mismatch, double-erase) are all resolved in the current revision. Both call sites now pass the full, unfiltered key arrays with a correctly-sized admit_mask, so the kernel's per-element mask check aligns with the right key at every slot. The CUDA kernel correctly null-guards indices before writing -1 for skipped slots, and the single erase call per admission path replaces the previous pre-filtered approach without introducing a second erase on the same keys. No files require special attention. The benchmark accesses internal model attributes that could break on API changes, but that is expected for a benchmark script. Important Files Changed
Sequence DiagramsequenceDiagram
participant PY as Python (batched_dynamicemb_function)
participant AC as admission_counter.erase()
participant HT as scored_hashtable.erase()
participant CU as table_erase() C++
participant KN as table_erase_kernel CUDA
note over PY: _prefetch_cache_path / _prefetch_hbm_direct_path
PY->>AC: "erase(keys[N], table_ids[N], mask=admit_mask[N])"
AC->>HT: "table_.erase(keys, table_ids, mask=mask)"
HT->>CU: "table_erase(..., indices=None, mask=mask)"
CU->>KN: "launch kernel batch=N keys table_ids indices=nullptr mask=mask_ptr"
loop for i in 0 to N
KN->>KN: if mask and not mask[i] write -1 continue
KN->>KN: else probe bucket erase key if found
end
KN-->>CU: done
CU-->>HT: return
HT-->>AC: return
AC-->>PY: return
Reviews (8): Last reviewed commit: "fix comment description" | Re-trigger Greptile |
1fa4fcc to
2718271
Compare
|
Want your agent to iterate on Greptile's feedback? Try greploops. |
|

Description
Checklist
Closes #357
Add optional bool mask parameter to table_erase CUDA kernel. When mask is provided, only masked (True) positions are erased; unmasked positions are skipped via early continue in the kernel. This fuses the pre-selection (keys[mask]) into the erase pass, eliminating a separate tensor allocation and memory copy.