Skip to content

Commit f6918fa

Browse files
sbryngelsonclaude
andcommitted
Route Phoenix case-opt pre-build through SLURM to avoid 4GB cgroup limit
Phoenix login nodes have a 4GB per-user cgroup memory limit that OOM-kills case-optimized GPU builds (confirmed via dmesg: CONSTRAINT_MEMCG). Route the pre-build through submit.sh on Phoenix so it runs on a compute node with full memory. Frontier continues to pre-build on the login node. Reverts retry/parallelism changes (max_attempts back to 3, -j back to 8) since the root cause was the cgroup, not parallelism. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 881160d commit f6918fa

File tree

4 files changed

+28
-17
lines changed

4 files changed

+28
-17
lines changed
Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,17 @@
11
#!/bin/bash
22

3-
# Pre-builds all benchmark cases with --case-optimization on the login node.
4-
# Called by the case-optimization CI job before SLURM submission.
5-
# Usage: bash .github/scripts/prebuild-case-optimization.sh <cluster> <device> <interface>
3+
# Pre-builds all benchmark cases with --case-optimization.
4+
# Can run in two modes:
5+
# 1. Direct (Frontier login nodes): pass cluster/device/interface as args
6+
# 2. Inside SLURM (Phoenix): uses $job_device/$job_interface from submit.sh
7+
# Usage: bash prebuild-case-optimization.sh [<cluster> <device> <interface>]
68

79
set -e
810

9-
cluster=$1
10-
job_device=$2
11-
job_interface=$3
11+
# Support both positional args (direct invocation) and env vars (SLURM via submit.sh)
12+
cluster="${1:-${job_cluster:-phoenix}}"
13+
job_device="${2:-$job_device}"
14+
job_interface="${3:-$job_interface}"
1215

1316
# Derive module flag from cluster name
1417
case "$cluster" in
@@ -21,10 +24,7 @@ esac
2124
. ./mfc.sh load -c "$flag" -m g
2225
source .github/scripts/gpu-opts.sh
2326

24-
# Case-optimized GPU builds are memory-intensive (nvfortran/CCE + target offload).
25-
# Login nodes have per-user cgroup memory limits (e.g., 4GB on Phoenix) that
26-
# cause OOM kills at higher parallelism.
2727
for case in benchmarks/*/case.py; do
2828
echo "=== Pre-building: $case ==="
29-
./mfc.sh build -i "$case" --case-optimization $gpu_opts -j 2
29+
./mfc.sh build -i "$case" --case-optimization $gpu_opts -j 8
3030
done

.github/scripts/retry-build.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ _retry_clean() {
2424
retry_build() {
2525
local clean_cmd="${RETRY_CLEAN_CMD:-rm -rf build/staging build/install build/lock.yaml}"
2626
local validate_cmd="${RETRY_VALIDATE_CMD:-}"
27-
local max_attempts=1
27+
local max_attempts=3
2828
local attempt=1
2929
while [ $attempt -le $max_attempts ]; do
3030
echo "Build attempt $attempt of $max_attempts..."

.github/workflows/bench.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -106,7 +106,7 @@ jobs:
106106
if: matrix.build_script != ''
107107
uses: nick-fields/retry@v3
108108
with:
109-
max_attempts: 1
109+
max_attempts: 3
110110
retry_wait_seconds: 60
111111
timeout_minutes: 150
112112
command: |

.github/workflows/test.yml

Lines changed: 16 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -249,7 +249,7 @@ jobs:
249249
if: matrix.cluster != 'phoenix'
250250
uses: nick-fields/retry@ce71cc2ab81d554ebbe88c79ab5975992d79ba08 # v3
251251
with:
252-
max_attempts: 1
252+
max_attempts: 3
253253
retry_wait_seconds: 60
254254
timeout_minutes: 60
255255
command: bash .github/workflows/${{ matrix.cluster }}/build.sh ${{ matrix.device }} ${{ matrix.interface }}
@@ -323,10 +323,15 @@ jobs:
323323
with:
324324
clean: false
325325

326-
- name: Pre-Build
326+
- name: Pre-Build (SLURM)
327+
if: matrix.cluster == 'phoenix'
328+
run: bash .github/workflows/phoenix/submit.sh .github/scripts/prebuild-case-optimization.sh ${{ matrix.device }} ${{ matrix.interface }}
329+
330+
- name: Pre-Build (login node)
331+
if: matrix.cluster != 'phoenix'
327332
uses: nick-fields/retry@ce71cc2ab81d554ebbe88c79ab5975992d79ba08 # v3
328333
with:
329-
max_attempts: 1
334+
max_attempts: 3
330335
retry_wait_seconds: 60
331336
timeout_minutes: 120
332337
command: bash .github/scripts/prebuild-case-optimization.sh ${{ matrix.cluster }} ${{ matrix.device }} ${{ matrix.interface }}
@@ -337,11 +342,17 @@ jobs:
337342

338343
- name: Print Logs
339344
if: always()
340-
run: cat run-case-optimization-${{ matrix.device }}-${{ matrix.interface }}.out
345+
run: |
346+
for f in prebuild-case-optimization-${{ matrix.device }}-${{ matrix.interface }}.out \
347+
run-case-optimization-${{ matrix.device }}-${{ matrix.interface }}.out; do
348+
[ -f "$f" ] && echo "=== $f ===" && cat "$f"
349+
done
341350
342351
- name: Archive Logs
343352
uses: actions/upload-artifact@v4
344353
if: always()
345354
with:
346355
name: case-opt-${{ strategy.job-index }}-${{ matrix.cluster }}-${{ matrix.interface }}
347-
path: run-case-optimization-${{ matrix.device }}-${{ matrix.interface }}.out
356+
path: |
357+
prebuild-case-optimization-${{ matrix.device }}-${{ matrix.interface }}.out
358+
run-case-optimization-${{ matrix.device }}-${{ matrix.interface }}.out

0 commit comments

Comments
 (0)