-
Notifications
You must be signed in to change notification settings - Fork 74
Description
Hello everyone,
I have a cuda and Kokkos code I'm trying to profile using region names. I have compiled kokkos-tools and generated "libkp_nvtx_connector.so". Then in the source code, I add where applicable:
#include <Kokkos_Profiling_ScopedRegion.hpp>
and in various code scopes I give them names by at the beginning of the scope adding this (with various names, here is just one example):
Kokkos::Profiling::ScopedRegion region("HorizInterpRemapperBase::HorizInterpRemapperBase");
Then in the sbatch (slurm) script, I invoke it by:
export KOKKOS_TOOLS_LIBS=/global/homes/m/mihelog/ThrustE/mihelog/kokkos-tools/profiling/nvtx-connector/kp_nvtx_connector.so
export KOKKOS_PROFILE_LIBRARY=/global/homes/m/mihelog/ThrustE/mihelog/kokkos-tools/profiling/nvtx-connector/kp_nvtx_connector.so
srun --label -n 4 -N 1 -c 32 --cpu_bind=cores -m plane=4 <executable> <application command-line arguments>
The problem is that I get runtime errors shown at the end of this post. I've isolated the issue to these calls because when I comment them out, the application completes successfully.
Any thoughts would be great. Thanks in advance!
number of MPI processes per node: min,max= 4 4
0:
2: var,nvars: 1 1
2: 3 **ABORTING WITH ERROR: NaNs detected in repro sum input**
0: Note: nsplit=-1, while nsplit must be >=1. We know SCREAM does not know nsplit until runtime, so this is fine.
0: Make sure nsplit is set to a valid value before calling prim_advance_subcycle!
0: gfr> nelemd 1350 qsize 10
0: compose> nelemd 1350 qsize 10 hv_q 1 hv_subcycle_q 6 lim 9 independent_time_steps 1
0: var,nvars: 1 1
1: var,nvars: 1 1
1: 2 **ABORTING WITH ERROR: NaNs detected in repro sum input**
2: MPICH ERROR [Rank 2] [job id 37885683.0] [Sat Apr 19 09:02:20 2025] [nid001413] - Abort(128) (rank 2 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 128) - process 2
2:
2: aborting job:
2: application called MPI_Abort(MPI_COMM_WORLD, 128) - process 2
3: var,nvars: 1 1
3: 4 **ABORTING WITH ERROR: NaNs detected in repro sum input**
0: 1 **ABORTING WITH ERROR: NaNs detected in repro sum input**
1: MPICH ERROR [Rank 1] [job id 37885683.0] [Sat Apr 19 09:02:20 2025] [nid001413] - Abort(128) (rank 1 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 128) - process 1
1:
0: MPICH ERROR [Rank 0] [job id 37885683.0] [Sat Apr 19 09:02:20 2025] [nid001413] - Abort(128) (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 128) - process 0
1: aborting job:
1: application called MPI_Abort(MPI_COMM_WORLD, 128) - process 1
0:
0: aborting job:
0: application called MPI_Abort(MPI_COMM_WORLD, 128) - process 0
3: MPICH ERROR [Rank 3] [job id 37885683.0] [Sat Apr 19 09:02:20 2025] [nid001413] - Abort(128) (rank 3 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 128) - process 3
3:
3: aborting job:
3: application called MPI_Abort(MPI_COMM_WORLD, 128) - process 3
1: Kokkos::Cuda ERROR: Failed to call Kokkos::Cuda::finalize()
1:
1: Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
1:
1: Backtrace for this error:
2: Kokkos::Cuda ERROR: Failed to call Kokkos::Cuda::finalize()
2:
2: Program received signal SIGSEGV: Segmentation fault - invalid memory reference.