What happened?
Of the nfld0 tests on GPU, the mpi2 cases currently fail on ECMWF's Grace-Hopper platform:
Start 183: ectrans-benchmark-gpu-dp_T47_O48_mpi0_omp1_callmode1_nfld0
1/12 Test #183: ectrans-benchmark-gpu-dp_T47_O48_mpi0_omp1_callmode1_nfld0 ... Passed 3.61 sec
Start 188: ectrans-benchmark-gpu-dp_T47_O48_mpi0_omp1_callmode2_nfld0
2/12 Test #188: ectrans-benchmark-gpu-dp_T47_O48_mpi0_omp1_callmode2_nfld0 ... Passed 3.87 sec
Start 193: ectrans-benchmark-gpu-dp_T47_O48_mpi1_omp1_callmode1_nfld0
3/12 Test #193: ectrans-benchmark-gpu-dp_T47_O48_mpi1_omp1_callmode1_nfld0 ... Passed 4.33 sec
Start 198: ectrans-benchmark-gpu-dp_T47_O48_mpi1_omp1_callmode2_nfld0
4/12 Test #198: ectrans-benchmark-gpu-dp_T47_O48_mpi1_omp1_callmode2_nfld0 ... Passed 5.15 sec
Start 203: ectrans-benchmark-gpu-dp_T47_O48_mpi2_omp1_callmode1_nfld0
5/12 Test #203: ectrans-benchmark-gpu-dp_T47_O48_mpi2_omp1_callmode1_nfld0 ...***Failed 7.42 sec
Start 208: ectrans-benchmark-gpu-dp_T47_O48_mpi2_omp1_callmode2_nfld0
6/12 Test #208: ectrans-benchmark-gpu-dp_T47_O48_mpi2_omp1_callmode2_nfld0 ...***Failed 6.88 sec
Start 213: ectrans-benchmark-gpu-sp_T47_O48_mpi0_omp1_callmode1_nfld0
7/12 Test #213: ectrans-benchmark-gpu-sp_T47_O48_mpi0_omp1_callmode1_nfld0 ... Passed 10.15 sec
Start 218: ectrans-benchmark-gpu-sp_T47_O48_mpi0_omp1_callmode2_nfld0
8/12 Test #218: ectrans-benchmark-gpu-sp_T47_O48_mpi0_omp1_callmode2_nfld0 ... Passed 7.47 sec
Start 223: ectrans-benchmark-gpu-sp_T47_O48_mpi1_omp1_callmode1_nfld0
9/12 Test #223: ectrans-benchmark-gpu-sp_T47_O48_mpi1_omp1_callmode1_nfld0 ... Passed 11.22 sec
Start 228: ectrans-benchmark-gpu-sp_T47_O48_mpi1_omp1_callmode2_nfld0
10/12 Test #228: ectrans-benchmark-gpu-sp_T47_O48_mpi1_omp1_callmode2_nfld0 ... Passed 5.43 sec
Start 233: ectrans-benchmark-gpu-sp_T47_O48_mpi2_omp1_callmode1_nfld0
11/12 Test #233: ectrans-benchmark-gpu-sp_T47_O48_mpi2_omp1_callmode1_nfld0 ...***Failed 13.04 sec
Start 238: ectrans-benchmark-gpu-sp_T47_O48_mpi2_omp1_callmode2_nfld0
12/12 Test #238: ectrans-benchmark-gpu-sp_T47_O48_mpi2_omp1_callmode2_nfld0 ...***Failed 7.40 sec
These tests pass when GPU_AWARE_MPI is disabled.
The crash is a segfault occurring here (line 802). It can be reproduced by
mpiexec -n 2 ./bin/ectrans-benchmark-gpu-dp -t 47 --nfld 0
Interestingly, higher resolutions (e.g. -t 95) don't show the crash. The lowest resolution I can run without experiencing the segfault is T70.
If I add --nlev 2 or --nprtrv 2 the crash goes away. So, probably related to the W-set splitting of grid point arrays.
Our CI suite on ECMWF's AC cluster (A100-based) doesn't show this crash.
Possibilities:
- There is a bug in the NVHPC we're using, 25.9
- There is a bug in the way
ZCOMBUFS is allocated which only manifests for edge cases such as low resolution and / or minimal fields.
Suspicious:
What are the steps to reproduce the bug?
ECMWF / AG (Grace Hopper cluster)
Currently Loaded Modules:
- prgenv/expert 2) nvidia/25.9 3) hpcx-openmpi/2.21.3-cuda:nvidia:25.9 4) fftw/3.3.10:nvidia:25.11 5) cmake/3.31.6
FIAT at version develop:230b015.
Configure with -DENABLE_GPU=ON -DENABLE_ACC=ON.
Version
develop:40d6bc2
Platform (OS and architecture)
ECMWF / AG
Relevant log output
Accompanying data
No response
Organisation
No response
What happened?
Of the nfld0 tests on GPU, the mpi2 cases currently fail on ECMWF's Grace-Hopper platform:
These tests pass when
GPU_AWARE_MPIis disabled.The crash is a segfault occurring here (line 802). It can be reproduced by
Interestingly, higher resolutions (e.g.
-t 95) don't show the crash. The lowest resolution I can run without experiencing the segfault is T70.If I add
--nlev 2or--nprtrv 2the crash goes away. So, probably related to the W-set splitting of grid point arrays.Our CI suite on ECMWF's AC cluster (A100-based) doesn't show this crash.
Possibilities:
ZCOMBUFSis allocated which only manifests for edge cases such as low resolution and / or minimal fields.Suspicious:
TRLTOGreferencesD%NLENGTFwhich is a Fourier-space related dimension. This doesn't make sense to me, because by this point in the code we're no longer in Fourier space.What are the steps to reproduce the bug?
ECMWF / AG (Grace Hopper cluster)
Currently Loaded Modules:
FIAT at version develop:230b015.
Configure with
-DENABLE_GPU=ON -DENABLE_ACC=ON.Version
develop:40d6bc2
Platform (OS and architecture)
ECMWF / AG
Relevant log output
Accompanying data
No response
Organisation
No response