Skip to content

Slowness of the Rebaseline.cxx in the SPNG branch when running with the GPU #41

Description

@physnerds

ISSUE Slowness of the Rebaseline.cxx in the SPNG branch when running with the GPU

This issue is related to this implementation of Rebaseline.cxx. In the SPNG Signal Processing workflow (for example) adc-to-spng.jsonnet, Rebaseliner node implements this method to apply ROIs in the different views of the APA as shown in adc-to-spng.pdf.

Possible Cause of the slowdown in the GPU.

Tests done in wcgpu0 machine with 1 CPU core and 1 RTX 4090

Running the adc-to-spng.jsonnet in CPU with 10 events results took 92 seconds (wall) to run with significant time taken by Rebaseliner node that is discussed above:

[00:13:47.637] I [ timer  ] Timer: 20.888 wall-sec, 23.102 core-sec, start: 2026-05-29 00:12:15, duration-sec: 2.083,  (WireCell::SPNG::Rebaseliner) "tpc0v2_applyroi_APPLY"
[00:13:47.637] I [ timer  ] Timer: 17.575 wall-sec, 19.705 core-sec, start: 2026-05-29 00:12:22, duration-sec: 1.757,  (WireCell::SPNG::Rebaseliner) "tpc0v0_applyroi_APPLY"
[00:13:47.637] I [ timer  ] Timer: 17.562 wall-sec, 19.585 core-sec, start: 2026-05-29 00:12:19, duration-sec: 1.765,  (WireCell::SPNG::Rebaseliner) "tpc0v1_applyroi_APPLY"
[00:13:47.637] I [ timer  ] Timer: 15.319 wall-sec, 283.092 core-sec, start: 2026-05-29 00:12:17, duration-sec: 1.529,  (WireCell::SPNG::CellViews) "tpc0_PRE"

Running the adc-to-spng.jsonnet in GPU with 10 events results took 814 seconds (wall) to run with significant time taken by Rebaseliner node:

[00:05:16.238] I [ timer  ] Timer: 301.357 wall-sec, 301.490 core-sec, start: 2026-05-28 23:51:42, duration-sec: 30.073,  (WireCell::SPNG::Rebaseliner) "tpc0v2_applyroi_APPLY"
[00:05:16.238] I [ timer  ] Timer: 251.780 wall-sec, 251.894 core-sec, start: 2026-05-28 23:52:38, duration-sec: 25.194,  (WireCell::SPNG::Rebaseliner) "tpc0v0_applyroi_APPLY"
[00:05:16.238] I [ timer  ] Timer: 251.414 wall-sec, 251.532 core-sec, start: 2026-05-28 23:52:12, duration-sec: 25.151,  (WireCell::SPNG::Rebaseliner) "tpc0v1_applyroi_APPLY"

In the rebaseline_zero function of the original Rebaseline.cxx, these lines here and here seem to be ineffective in the GPU, probably (suspecting)because here each tensor element is being compared in the CPU instead of GPU which requires memcpy from GPU to CPU and synchronization of the operations.

However, since this issue was not noticed before, a proper investigation on the CUDA driver (13.x before vs 12.x now) is needed.

Possible Solution

We can do the zero comparison of the torch elements and copy over to the cpu in bulk like this here that replaces the original. This method is only used when the signal processing is done in the GPU. If the signal processing is running with CPU only, the current implementation follows the original method.

After this implementation, the total run time in the GPU was 10.9 seconds. Here is the printout for the same nodes after modification:

[17:05:20.065] I [ timer  ] Timer: 3.175 wall-sec, 3.174 core-sec, start: 2026-05-28 17:05:08, duration-sec: 0.578,  (WireCell::Sio::FrameFileSource) "drifted-cosmics_10-frame.npz"
[17:05:20.065] I [ timer  ] Timer: 2.969 wall-sec, 2.988 core-sec, start: 2026-05-28 17:05:10, duration-sec: 0.292,  (WireCell::Sio::FrameFileSink) "adc-to-spng.npz"
[17:05:20.065] I [ timer  ] Timer: 1.768 wall-sec, 3.206 core-sec, start: 2026-05-28 17:05:09, duration-sec: 0.13,  (WireCell::SPNG::Transform) "tpc0v1_dnnroi_pre_FWD"
[17:05:20.065] I [ timer  ] Timer: 0.482 wall-sec, 0.487 core-sec, start: 2026-05-28 17:05:10, duration-sec: 0.074,  (WireCell::SPNG::Rebaseliner) "tpc0v0_applyroi_APPLY"
[17:05:20.065] I [ timer  ] Timer: 0.471 wall-sec, 0.479 core-sec, start: 2026-05-28 17:05:10, duration-sec: 0.058,  (WireCell::SPNG::Rebaseliner) "tpc0v1_applyroi_APPLY"
[17:05:20.065] I [ timer  ] Timer: 0.433 wall-sec, 0.437 core-sec, start: 2026-05-28 17:05:09, duration-sec: 0.19,  (WireCell::SPNG::TensorForward) "tpc0v1_dnnroi_fwd_FWD"
[17:05:20.065] I [ timer  ] Timer: 0.326 wall-sec, 8.796 core-sec, start: 2026-05-28 17:05:09, duration-sec: 0.031,  (WireCell::SPNG::FrameToTdm) "tpc0_TOTDM"
[17:05:20.065] I [ timer  ] Timer: 0.283 wall-sec, 0.286 core-sec, start: 2026-05-28 17:05:10, duration-sec: 0.04,  (WireCell::SPNG::TensorForward) "tpc0v0_dnnroi_fwd_FWD"
[17:05:20.065] I [ timer  ] Timer: 0.246 wall-sec, 0.250 core-sec, start: 2026-05-28 17:05:09, duration-sec: 0.037,  (WireCell::SPNG::Rebaseliner) "tpc0v2_applyroi_APPLY"
[17:05:20.065] I [ timer  ] Timer: 0.148 wall-sec, 0.152 core-s

There was no change in CPU runtime since the implementation for the CPU is not changed.

Testing the Implementation

I have added a test code that compares the CPU and GPU run time for the original (before modification) and modified implementation of the rebaseline function with a tensor of shape (2560,6000). The test function tests the parts of the code from Rebaseline.cxx that causes significant delay when running in the GPU and compares with the modification both in terms of compute time and resulting output.
The test compares the outputs of :

  1. Original Implementations of Rebaseline.cxx when running with CPU
  2. Original Implementation of Rebaseline.cxx when running with GPU
  3. New Implementation of Rebaseline.cxx when running with CPU
  4. New Implementation of Rebaseline.cxx when running with GPU

Here are the resulting output that reinforces the observation and the possible solution discussed in this issue:

=== workflow-like full 2560x6000 dim=1 ===
time original CPU: 5.833142 s
time new CPU:      5.849403 s
time new device:   0.343339 s
time original device: 80.425722 s
original CPU: shape=[2560, 6000] dtype=float min=-796.681763 max=793.453308 mean=-0.012442 nonzero=1391236
new CPU: shape=[2560, 6000] dtype=float min=-796.681763 max=793.453308 mean=-0.012442 nonzero=1391236
new device: shape=[2560, 6000] dtype=float min=-796.681763 max=793.453308 mean=-0.012442 nonzero=1391236
original device: shape=[2560, 6000] dtype=float min=-796.681763 max=793.453308 mean=-0.012442 nonzero=1391236
max_abs(original CPU - new CPU):    0.000000
max_abs(original CPU - new device): 0.000000
max_abs(original CPU - original device): 0.000000
 
Note: Here original means the implementation of Rebaseline.cxx from: 
 
Link: https://github.com/WireCell/wire-cell-toolkit/commit/95da99676f9ad579da54adbea6b40ccf6f3b35d1#diff-f336204ad6d7172eda9af20f611637a11b4fa2a0cdf5a67e65f09f913a8aa587 
 
Note: Here new means the implementation of Rebaseline.cxx from: 
 
Link: https://github.com/WireCell/wire-cell-toolkit/commit/9939e37ccfd27ba8f81fe331a27350da36347584 

Note: original implementation is timed on CUDA when CUDA is available.

Metadata

Metadata

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions