Skip to content

Commit 2a6b783

Browse files
author
Spencer Bryngelson
committed
fix: use --nodes/--ntasks-per-node for srun GPU dispatch
The Frontier templates use '--nodes 1 --ntasks-per-node N' for srun, not '--ntasks N'. With bare --ntasks, --gpu-bind closest lacks the node topology needed for correct GPU binding, causing Bus errors on multi-rank GPU tests (3D RDMA MPI).
1 parent 0194492 commit 2a6b783

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

toolchain/mfc/test/case.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -119,7 +119,7 @@ def _mpi_cmd(cfg: MPIConfig, ppn: int, exe: str, gpu: bool = False) -> List[str]
119119
return [binary, "-np", str(ppn), *cfg.flags, exe]
120120
if binary == "srun":
121121
gpu_flags = ["--gpus-per-task", "1", "--gpu-bind", "closest"] if gpu else []
122-
return [binary, "--ntasks", str(ppn), *gpu_flags, *cfg.flags, exe]
122+
return [binary, "--nodes", "1", "--ntasks-per-node", str(ppn), *gpu_flags, *cfg.flags, exe]
123123
if binary == "jsrun":
124124
gpu_per_rs = "1" if gpu else "0"
125125
return [binary, "--nrs", str(ppn), "--cpu_per_rs", "1", "--gpu_per_rs", gpu_per_rs, "--tasks_per_rs", "1", *cfg.flags, exe]

0 commit comments

Comments
 (0)