Skip to content

fix: fix CUDA elementwise dynamic block shared memory#172

Merged
kilinchange merged 1 commit into
masterfrom
fix_batch_size
Jun 12, 2026
Merged

fix: fix CUDA elementwise dynamic block shared memory#172
kilinchange merged 1 commit into
masterfrom
fix_batch_size

Conversation

@chen2021673

@chen2021673 chen2021673 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

概要

修复 CUDA elementwise backward 在动态选择 block size 后,共享内存大小仍使用旧的模板参数 BLOCK_SIZE 计算的问题。

问题背景

LaunchKernel 已经改为根据 tensor size 通过 ChooseBlockSize() 动态选择 block size。但部分 elementwise backward launch 逻辑仍然使用模板参数 BLOCK_SIZE 来计算 dynamic shared memory。

这会导致实际 launch 配置和共享内存分配发生分歧:大 tensor 下运行时 block size 可能选择 512,但 shared memory 仍按 BLOCK_SIZE == 256 计算。bf16/half broadcast backward 的 block-reduction kernel 内部按 blockDim.x 访问 shared memory,最终 shared memory 分配不足,出现越界写。

修复方案

  • 移除 elementwise launch helper 中不再可靠的 BLOCK_SIZE 模板参数,新增 ChooseBlockDims(),统一根据 ChooseBlockSize() 生成实际 launch block。
  • forward/backward launch path 统一使用运行时选择的 block dims。
  • dynamic shared memory 大小改为使用实际传入 launch lambda 的 block.x 计算。
  • 新增 bf16 broadcast backward 大 block 回归测试。

回归测试

新增测试:

AutogradElementwiseBackwardTest.BFloat16MulBroadcastBackwardLargeBlock

覆盖场景:

  • a: [512, 8192], bf16
  • b: [8192], bf16 broadcast
  • grad: [512, 8192], bf16

.62机器测试结果:
image
image

Comment thread infini_train/src/kernels/cuda/elementwise.cu Outdated
@kilinchange

Copy link
Copy Markdown
Collaborator

另外 commit message 麻烦修正下:
fix: fix CUDA elementwise dynamic block shared memory

@kilinchange

Copy link
Copy Markdown
Collaborator
  1. 补充 node24 测试截图;
  2. format 修复。

@chen2021673

chen2021673 commented Jun 11, 2026

Copy link
Copy Markdown
Contributor Author
  1. 补充 node24 测试截图;
  2. format 修复。
  1. node24 测试截图:
image image
  1. 已修复。

@kilinchange kilinchange merged commit 456e2fd into master Jun 12, 2026
2 checks passed
@kilinchange kilinchange deleted the fix_batch_size branch June 12, 2026 01:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants