Skip to content

Runtime NaN and segfault with Kokkos::Profiling::ScopedRegion region() #283

@mihelog

Description

@mihelog

Hello everyone,

I have a cuda and Kokkos code I'm trying to profile using region names. I have compiled kokkos-tools and generated "libkp_nvtx_connector.so". Then in the source code, I add where applicable:

#include <Kokkos_Profiling_ScopedRegion.hpp>

and in various code scopes I give them names by at the beginning of the scope adding this (with various names, here is just one example):

Kokkos::Profiling::ScopedRegion region("HorizInterpRemapperBase::HorizInterpRemapperBase");

Then in the sbatch (slurm) script, I invoke it by:

export KOKKOS_TOOLS_LIBS=/global/homes/m/mihelog/ThrustE/mihelog/kokkos-tools/profiling/nvtx-connector/kp_nvtx_connector.so
export KOKKOS_PROFILE_LIBRARY=/global/homes/m/mihelog/ThrustE/mihelog/kokkos-tools/profiling/nvtx-connector/kp_nvtx_connector.so
srun  --label  -n 4 -N 1 -c 32  --cpu_bind=cores   -m plane=4 <executable> <application command-line arguments>

The problem is that I get runtime errors shown at the end of this post. I've isolated the issue to these calls because when I comment them out, the application completes successfully.

Any thoughts would be great. Thanks in advance!

 number of MPI processes per node: min,max=           4           4
0:
2:  var,nvars:           1           1
2:            3  **ABORTING WITH ERROR: NaNs detected in repro sum input**
0: Note: nsplit=-1, while nsplit must be >=1. We know SCREAM does not know nsplit until runtime, so this is fine.
0:       Make sure nsplit is set to a valid value before calling prim_advance_subcycle!
0: gfr> nelemd 1350 qsize 10
0: compose> nelemd 1350 qsize 10 hv_q 1 hv_subcycle_q 6 lim 9 independent_time_steps 1
0:  var,nvars:           1           1
1:  var,nvars:           1           1
1:            2  **ABORTING WITH ERROR: NaNs detected in repro sum input**
2: MPICH ERROR [Rank 2] [job id 37885683.0] [Sat Apr 19 09:02:20 2025] [nid001413] - Abort(128) (rank 2 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 128) - process 2
2:
2: aborting job:
2: application called MPI_Abort(MPI_COMM_WORLD, 128) - process 2
3:  var,nvars:           1           1
3:            4  **ABORTING WITH ERROR: NaNs detected in repro sum input**
0:            1  **ABORTING WITH ERROR: NaNs detected in repro sum input**
1: MPICH ERROR [Rank 1] [job id 37885683.0] [Sat Apr 19 09:02:20 2025] [nid001413] - Abort(128) (rank 1 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 128) - process 1
1:
0: MPICH ERROR [Rank 0] [job id 37885683.0] [Sat Apr 19 09:02:20 2025] [nid001413] - Abort(128) (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 128) - process 0
1: aborting job:
1: application called MPI_Abort(MPI_COMM_WORLD, 128) - process 1
0:
0: aborting job:
0: application called MPI_Abort(MPI_COMM_WORLD, 128) - process 0
3: MPICH ERROR [Rank 3] [job id 37885683.0] [Sat Apr 19 09:02:20 2025] [nid001413] - Abort(128) (rank 3 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 128) - process 3
3:
3: aborting job:
3: application called MPI_Abort(MPI_COMM_WORLD, 128) - process 3
1: Kokkos::Cuda ERROR: Failed to call Kokkos::Cuda::finalize()
1:
1: Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
1:
1: Backtrace for this error:
2: Kokkos::Cuda ERROR: Failed to call Kokkos::Cuda::finalize()
2:
2: Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions