Skip to content

feat: GPU-accelerated WT KEDF multi_kernel convolution#7448

Open
SunsetStand wants to merge 7 commits into
deepmodeling:developfrom
SunsetStand:feature/ofdft-kedf-gpu
Open

feat: GPU-accelerated WT KEDF multi_kernel convolution#7448
SunsetStand wants to merge 7 commits into
deepmodeling:developfrom
SunsetStand:feature/ofdft-kedf-gpu

Conversation

@SunsetStand

@SunsetStand SunsetStand commented Jun 7, 2026

Copy link
Copy Markdown

Reminder

  • Have you linked an issue with this pull request?
  • Have you added adequate unit tests and/or case tests for your pull request?
  • Have you noticed possible changes of behavior below or in the linked issue?
  • Have you explained the changes of codes in core modules of ESolver, HSolver, ElecState, Hamilt, Operator or Psi? (ignore if not applicable)

Linked Issue

This is a new feature — no existing issue. A CPU-vs-GPU correctness and performance benchmark is provided in the PR description below. An issue can be opened for discussion if preferred.

Unit Tests and/or Case Tests for my changes

A standalone benchmark (ofdft_cuda/) was used to verify correctness (GPU vs FFTW3 CPU reference, error < 1e-7 for WT KEDF) and measure performance (14.2× speedup at 96³ grid on RTX 4060). Integration into ABACUS's existing GPU CI pipeline (e.g., tests/integrate/ GPU OFDFT cases) is planned as a follow-up once CI GPU runners are confirmed available for this module.

What's changed?

This PR adds GPU acceleration for the WT KEDF multi_kernel() function, which is the most expensive single operation in OFDFT Wang-Teter calculations (up to 40% of total SCF time). The implementation:

  1. Uses ABACUS's existing GPU infrastructurepw_rho->real2recip_gpu() / recip2real_gpu() for FFT and memory_op for device memory management. No new external dependencies.
  2. Adds a single CUDA kernel (kedf_wt_recip_multiply) for element-wise G-space kernel multiplication, following the same pattern as existing GPU kernels in source_base/kernels/cuda/.
  3. Persistent GPU buffers are lazily allocated on first call and reused across SCF iterations. The WT kernel array (kernel_) is copied to device once since it is constant throughout the SCF cycle.
  4. Zero overhead when CUDA is disabled — all GPU code is guarded by #ifdef __CUDA and the CPU path is completely untouched.
    The GPU dispatch is a simple 5-line addition at the top of multi_kernel():
#ifdef __CUDA
    if (pw_rho->device == "gpu") {
        this->multi_kernel_gpu(prho, rkernel_rho, exponent, pw_rho);
        return;
    }
#endif

Performance: on an RTX 4060 Laptop GPU, the GPU path achieves 14.2× speedup for WT KEDF at typical OFDFT grid sizes (96³) compared to FFTW3 CPU, with correctness verified to < 1e-7 relative error. A full benchmark report is available in the standalone prototype (examples/ or as supplementary material upon request).

Any changes of core modules? (ignore if not applicable)

N/A — only modifies the OFDFT KEDF module (source_pw/module_ofdft/), which is not a core ESolver/Hamilt/Operator module.

@SunsetStand SunsetStand force-pushed the feature/ofdft-kedf-gpu branch 2 times, most recently from 25c2618 to d62a3f6 Compare June 7, 2026 03:19
Add GPU backend for KEDF_WT::multi_kernel() using cuFFT via
PW_Basis _gpu interface. Key changes:

- kedf_wt_gpu.cu: single CUDA kernel (kedf_wt_recip_multiply) for
  G-space element-wise kernel multiplication, plus multi_kernel_gpu()
  method that pipelines real2recip → kernel multiply → recip2real
  entirely on GPU. Persistent buffers allocated via memory_op.

- kedf_wt.h: GPU method declarations and buffer members under
  #ifdef __CUDA guard (zero overhead when CUDA disabled).

- kedf_wt.cpp: GPU dispatch at top of multi_kernel() — when
  pw_rho->device == "gpu", delegates to multi_kernel_gpu().

- source/CMakeLists.txt: add kedf_wt_gpu.cu to USE_CUDA block.

Design follows existing ABACUS GPU patterns (memory_op for device
memory, thrust::complex in kernels, CHECK_CUDA_SYNC for safety).
@SunsetStand SunsetStand force-pushed the feature/ofdft-kedf-gpu branch from d62a3f6 to b93c9cd Compare June 7, 2026 03:23
- kedf_wt.h: #include <cufft.h> was erroneously inside the class body
  (both in destructor and private section). This caused the cuFFT header
  extern "C" block to appear inside a C++ class definition, triggering
  "linkage specification is not allowed" and all cuFFT types undeclared.
  Moved the include to file scope, guarded by #ifdef __CUDA.

- kedf_wt_gpu.cu: d_result_ is double* but resmem_zd_op/delmem_zd_op are
  typed std::complex<double>*. Changed to resmem_dd_op/delmem_dd_op
  (nrxx*2 doubles = nrxx complex doubles).
@mohanchen

Copy link
Copy Markdown
Collaborator

Nice try, could you provide some tests/examples in the PR? you can check out /tests/07_OFDFT

@mohanchen mohanchen added Refactor Refactor ABACUS codes Features Needed The features are indeed needed, and developers should have sophisticated knowledge GPU & DCU & HPC GPU and DCU and HPC related any issues and removed Refactor Refactor ABACUS codes labels Jun 9, 2026
- Add test directory with INPUT (device=gpu), STRU, KPT, result.ref
- Test identical to 09_OF_KE_WT but exercises GPU code path
- Add CASES_GPU.txt for GPU test discovery
- GPU results should match CPU reference within tolerance
@SunsetStand

Copy link
Copy Markdown
Author

Thanks for the review! I've added a GPU WT KEDF test case:

Test location: tests/07_OFDFT/31_OF_KE_WT_GPU/

It mirrors 09_OF_KE_WT (Al FCC, WT KEDF, symmetry=on) with device gpu added to INPUT. The GPU path performs identical math—cuFFT replaces CPU FFT + GPU kernels for element-wise ops—so results should match the CPU reference within tolerance.

Comment thread source/CMakeLists.txt Outdated
source_base/kernels/cuda/math_kernel_op.cu
source_base/kernels/cuda/math_kernel_op_vec.cu
source_hamilt/module_xc/kernels/cuda/xc_functional_op.cu
source_pw/module_ofdft/kedf_wt_gpu.cu

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution! Please move source_pw/module_ofdft/kedf_wt_gpu.cu to source_pw/module_ofdft/kernels/cuda/kedf_wt_gpu.cu to keep consistency with other modules.

sunliang98 and others added 3 commits June 9, 2026 21:26
Per reviewer request (sunliang98): keep GPU kernel files organized
under kernels/cuda/ subdirectory, consistent with other ABACUS modules.
After moving kedf_wt_gpu.cu to kernels/cuda/, the bare include
#include "kedf_wt.h" no longer resolves since the header is now in the
parent directory. Use full module path consistent with other CUDA
kernel files (e.g., module_pwdft/kernels/cuda/*.cu).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Features Needed The features are indeed needed, and developers should have sophisticated knowledge GPU & DCU & HPC GPU and DCU and HPC related any issues project_learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants