Skip to content

nfld0 segfaults on GPU with GPU-aware comms for NPROC > 2 #362

@samhatfield

Description

@samhatfield

What happened?

Of the nfld0 tests on GPU, the mpi2 cases currently fail on ECMWF's Grace-Hopper platform:

      Start 183: ectrans-benchmark-gpu-dp_T47_O48_mpi0_omp1_callmode1_nfld0
 1/12 Test #183: ectrans-benchmark-gpu-dp_T47_O48_mpi0_omp1_callmode1_nfld0 ...   Passed    3.61 sec
      Start 188: ectrans-benchmark-gpu-dp_T47_O48_mpi0_omp1_callmode2_nfld0
 2/12 Test #188: ectrans-benchmark-gpu-dp_T47_O48_mpi0_omp1_callmode2_nfld0 ...   Passed    3.87 sec
      Start 193: ectrans-benchmark-gpu-dp_T47_O48_mpi1_omp1_callmode1_nfld0
 3/12 Test #193: ectrans-benchmark-gpu-dp_T47_O48_mpi1_omp1_callmode1_nfld0 ...   Passed    4.33 sec
      Start 198: ectrans-benchmark-gpu-dp_T47_O48_mpi1_omp1_callmode2_nfld0
 4/12 Test #198: ectrans-benchmark-gpu-dp_T47_O48_mpi1_omp1_callmode2_nfld0 ...   Passed    5.15 sec
      Start 203: ectrans-benchmark-gpu-dp_T47_O48_mpi2_omp1_callmode1_nfld0
 5/12 Test #203: ectrans-benchmark-gpu-dp_T47_O48_mpi2_omp1_callmode1_nfld0 ...***Failed    7.42 sec
      Start 208: ectrans-benchmark-gpu-dp_T47_O48_mpi2_omp1_callmode2_nfld0
 6/12 Test #208: ectrans-benchmark-gpu-dp_T47_O48_mpi2_omp1_callmode2_nfld0 ...***Failed    6.88 sec
      Start 213: ectrans-benchmark-gpu-sp_T47_O48_mpi0_omp1_callmode1_nfld0
 7/12 Test #213: ectrans-benchmark-gpu-sp_T47_O48_mpi0_omp1_callmode1_nfld0 ...   Passed   10.15 sec
      Start 218: ectrans-benchmark-gpu-sp_T47_O48_mpi0_omp1_callmode2_nfld0
 8/12 Test #218: ectrans-benchmark-gpu-sp_T47_O48_mpi0_omp1_callmode2_nfld0 ...   Passed    7.47 sec
      Start 223: ectrans-benchmark-gpu-sp_T47_O48_mpi1_omp1_callmode1_nfld0
 9/12 Test #223: ectrans-benchmark-gpu-sp_T47_O48_mpi1_omp1_callmode1_nfld0 ...   Passed   11.22 sec
      Start 228: ectrans-benchmark-gpu-sp_T47_O48_mpi1_omp1_callmode2_nfld0
10/12 Test #228: ectrans-benchmark-gpu-sp_T47_O48_mpi1_omp1_callmode2_nfld0 ...   Passed    5.43 sec
      Start 233: ectrans-benchmark-gpu-sp_T47_O48_mpi2_omp1_callmode1_nfld0
11/12 Test #233: ectrans-benchmark-gpu-sp_T47_O48_mpi2_omp1_callmode1_nfld0 ...***Failed   13.04 sec
      Start 238: ectrans-benchmark-gpu-sp_T47_O48_mpi2_omp1_callmode2_nfld0
12/12 Test #238: ectrans-benchmark-gpu-sp_T47_O48_mpi2_omp1_callmode2_nfld0 ...***Failed    7.40 sec

These tests pass when GPU_AWARE_MPI is disabled.

The crash is a segfault occurring here (line 802). It can be reproduced by

mpiexec -n 2 ./bin/ectrans-benchmark-gpu-dp -t 47 --nfld 0

Interestingly, higher resolutions (e.g. -t 95) don't show the crash. The lowest resolution I can run without experiencing the segfault is T70.

If I add --nlev 2 or --nprtrv 2 the crash goes away. So, probably related to the W-set splitting of grid point arrays.

Our CI suite on ECMWF's AC cluster (A100-based) doesn't show this crash.

Possibilities:

  • There is a bug in the NVHPC we're using, 25.9
  • There is a bug in the way ZCOMBUFS is allocated which only manifests for edge cases such as low resolution and / or minimal fields.

Suspicious:

What are the steps to reproduce the bug?

ECMWF / AG (Grace Hopper cluster)

Currently Loaded Modules:

  1. prgenv/expert 2) nvidia/25.9 3) hpcx-openmpi/2.21.3-cuda:nvidia:25.9 4) fftw/3.3.10:nvidia:25.11 5) cmake/3.31.6

FIAT at version develop:230b015.

Configure with -DENABLE_GPU=ON -DENABLE_ACC=ON.

Version

develop:40d6bc2

Platform (OS and architecture)

ECMWF / AG

Relevant log output

Accompanying data

No response

Organisation

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions