Skip to content

Optimize Cube output with OpenMP formatting and async writing#7425

Open
Ulikin wants to merge 2 commits into
deepmodeling:LTSfrom
Ulikin:optimize-cube-output
Open

Optimize Cube output with OpenMP formatting and async writing#7425
Ulikin wants to merge 2 commits into
deepmodeling:LTSfrom
Ulikin:optimize-cube-output

Conversation

@Ulikin
Copy link
Copy Markdown

@Ulikin Ulikin commented Jun 3, 2026

Summary

This PR optimizes Cube file output in write_cube() by reducing the overhead of formatting and writing large real-space grid data.

Main changes:

  • Replace repeated iostream-based Cube data formatting with preallocated buffers.
  • Parallelize Cube data formatting on the output rank with OpenMP.
  • Write Cube data in chunks while preserving the original data order.
  • Add an asynchronous writer with a bounded queue to overlap formatted data submission and file writing.
  • Keep the Cube header format, data precision, and output ordering unchanged.

Motivation

When out_chg is enabled for large systems, Cube output can become a noticeable cost. Initial profiling showed that most of the time was spent in write_cube_data_records.

Further timing showed that the main bottleneck was not only disk I/O, but also converting a large number of floating-point grid values into text. Therefore, this PR first optimizes the formatting path, then adds asynchronous writing to hide part of the remaining write time.

Performance

Test case:

  • System: Si256 LCAO
  • Grid: 375 x 192 x 192
  • Output: SPIN1_CHG.cube
  • Formatting threads: 8 OpenMP threads on the output rank

Observed timing:

Version write_cube_total
Original ~5.8-6.0 s
OpenMP formatting ~0.77-1.36 s
OpenMP + async writing ~0.71-1.33 s

For the complete write_vdata_palgrid output path:

Version write_vdata_palgrid_total
Original ~6.3-6.5 s
OpenMP formatting ~1.30-1.73 s
OpenMP + async writing ~1.24-1.71 s

Correctness

The optimized path preserves:

  • Cube header format;
  • data ordering;
  • output precision;
  • number of values per data line;
  • rank-0 ordered file writing.

The change is limited to Cube output and does not modify SCF logic or physical quantities.

Notes

The asynchronous writer uses a bounded queue with capacity 2. If the background I/O thread is slower than formatting, push() blocks when the queue is full, preventing unbounded memory growth.

This work is based on the LTS branch used in the course environment. If the target branch has diverged or contains concurrent course submissions, manual conflict resolution may be required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants