Fix SIGILL crashes on GitHub runners via CPU-aware build cache keys#1278
Conversation
GitHub-hosted runners have heterogeneous CPU architectures. When a build cached on a runner with AVX-512 is restored on a runner without it, the binary crashes with SIGILL (illegal instruction). This affected all 9 Chemistry tests on ubuntu-mpi-no-debug. Add MFC_NATIVE_CPU CMake option (default ON) gated by MFC_NO_MARCH_NATIVE env var. Set the env var in GitHub-hosted CI jobs only — self-hosted HPC runners (Phoenix, Frontier) continue using -march=native. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
GitHub-hosted runners have heterogeneous CPU architectures. When a build cached on a runner with AVX-512 is restored on a runner without it, the binary crashes with SIGILL (illegal instruction). This affected Chemistry tests on ubuntu-mpi-no-debug. Include a hash of CPU model and compiler versions in the GitHub Actions build cache key so that builds compiled with -march=native are only restored on runners with matching hardware and toolchain. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The cache key hash must reflect the actual compiler (e.g. ifx for Intel jobs, gfortran-15 for macOS). Moving the sys-info step after all setup steps ensures $FC and $CC are set correctly before hashing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…precision
When matrix.precision is empty, --${{ matrix.precision }} expands to bare --,
which is the POSIX end-of-options marker. This causes --test-all after it to
be treated as a positional argument, so post_process is never built in the
Build (dry-run) step. It then gets built in the Test step instead, including
its fftw/hdf5/silo dependencies.
Move precision to an env var that only emits --single when non-empty.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ler versions The Intel setvars.sh sets PATH/LD_LIBRARY_PATH but never exports FC or CC, so the sys-info cache key step was falling back to gfortran/gcc versions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…the C compiler The Intel CI only installs the Fortran compiler (ifx), not the C/C++ compiler (icx). Setting CC=icx causes CMake to fail with "Could not find compiler". Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Setting FC=ifx in GITHUB_ENV caused CMake's lapack Fortran/C interop check to fail (ifx + gcc linking). The Intel setvars.sh already puts ifx on PATH, so CMake finds it without FC being set. Instead, detect ifx on PATH in the sys-info step to include its version in the cache hash. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #1278 +/- ##
=======================================
Coverage 44.04% 44.04%
=======================================
Files 70 70
Lines 20499 20499
Branches 1993 1993
=======================================
Hits 9028 9028
Misses 10330 10330
Partials 1141 1141 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…hub.com/sbryngelson/mfc into fix/disable-march-native-github-runners
Claude Code ReviewHead SHA: 10a10e9 Summary
Findings1. PR description mentions Intel 2. Inconsistency between 3. 4. ARM Linux: No blocking issuesThe core logic is correct — the |
|
Caution Review failedThe pull request is closed. ℹ️ Recent review infoConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThe pull request modifies two GitHub Actions workflows to make build cache keys system-dependent. Both Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
This PR fixes intermittent SIGILL crashes on GitHub-hosted runners (issue #1277) caused by -march=native binaries being cached on one CPU architecture and restored on another. It also fixes a related build-step bug where --test-all was silently dropped.
Changes:
- Add a CPU + compiler version hash (
sys-hash) to GitHub Actions build cache keys, preventing cross-CPU cache collisions - Fix
--test-allbeing swallowed by movingmatrix.precisionfrom an inline CLI expansion (--${{ matrix.precision }}) to an env var ($PRECISION), preventing a bare--(POSIX end-of-options) when precision is empty - Apply the same CPU-aware cache key to
coverage.yml
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
.github/workflows/test.yml |
Adds Get system info for cache key step (CPU model + Fortran/C compiler version hash); updates cache key to include sys-hash; moves precision flag to $PRECISION env var in Build step |
.github/workflows/coverage.yml |
Adds the same Get system info for cache key step and incorporates sys-hash into the coverage cache key |
Two issues were found:
-
test.ymlline 141 —CCstill hashesgccon Intel builds: The PR description states "FC=ifxandCC=icxexplicitly for Intel CI jobs," but the Intel setup step never writesCC=icxto$GITHUB_ENV(sincesetvars.shdoesn't setCC). Theifxdetection on line 140 works correctly via PATH, but the C compiler hash always resolves togcc. This is partially mitigated bymatrix.intelalready being part of the cache key, but it's an inaccuracy and a discrepancy with the stated fix. -
coverage.ymllines 38–47 —Get system inforuns beforeSetup Ubuntu: The compiler hash is computed beforeapt install gcc g++runs. If the installed version differs from the pre-installed default, the hash won't reflect the actual build compiler. Intest.ymlthis step is correctly placed after all OS setup steps.
| uname -m | ||
| cat /proc/cpuinfo 2>/dev/null | grep 'model name' | head -1 || sysctl -n machdep.cpu.brand_string 2>/dev/null || true | ||
| if command -v ifx &>/dev/null; then ifx --version 2>/dev/null | head -1; else ${FC:-gfortran} --version 2>/dev/null | head -1 || true; fi | ||
| ${CC:-gcc} --version 2>/dev/null | head -1 || true |
There was a problem hiding this comment.
The PR description states "Set FC=ifx and CC=icx explicitly for Intel CI jobs", but the code does not explicitly write FC=ifx or CC=icx to $GITHUB_ENV in the Intel setup step. The ifx Fortran compiler is detected correctly on line 140 via command -v ifx (since setvars.sh updates PATH which is exported via printenv >> $GITHUB_ENV). However, CC is not set by setvars.sh, so line 141 always hashes gcc's version rather than icx's on Intel runs. While this is mitigated by ${{ matrix.intel }} already being part of the cache key, the hash does not accurately reflect the Intel C compiler version. Adding echo "CC=icx" >> $GITHUB_ENV in the Intel setup step (alongside the printenv call) would fix this inconsistency.
| - name: Get system info for cache key | ||
| id: sys-info | ||
| run: | | ||
| { | ||
| uname -m | ||
| cat /proc/cpuinfo 2>/dev/null | grep 'model name' | head -1 || sysctl -n machdep.cpu.brand_string 2>/dev/null || true | ||
| ${FC:-gfortran} --version 2>/dev/null | head -1 || true | ||
| ${CC:-gcc} --version 2>/dev/null | head -1 || true | ||
| } | (sha256sum 2>/dev/null || shasum -a 256) | cut -c1-16 > /tmp/sys-hash | ||
| echo "sys-hash=$(cat /tmp/sys-hash)" >> "$GITHUB_OUTPUT" |
There was a problem hiding this comment.
The "Get system info for cache key" step runs before "Setup Ubuntu" (which installs gcc, g++, etc.). If apt install gcc installs or upgrades to a different compiler version than what is pre-installed on ubuntu-latest, the hash will reflect the pre-installation version rather than the version used for the build. For correctness, this step should be placed after "Setup Ubuntu", consistent with the ordering in test.yml.
Summary
-march=nativebinaries are restored on runners with different instruction sets--test-allbeing silently ignored in the Build step, causing post_process (+ hdf5/silo/fftw) to rebuild during the Test stepFC=ifxandCC=icxexplicitly for Intel CI jobsFixes #1277
Details
CPU-aware cache keys
GitHub-hosted runners have heterogeneous CPU architectures (different Intel/AMD generations). MFC builds with
-march=native, so a binary cached on a runner with AVX-512 will crash withSIGILLwhen restored on a runner without it.The fix hashes
uname -m, CPU model name, and Fortran/C compiler versions into the cache key. Different hardware or compiler → different key → no cache reuse.Fix
--test-allin Build stepWhen
matrix.precisionis empty,--${{ matrix.precision }}expands to bare--(the POSIX end-of-options marker), causing--test-allafter it to be treated as a positional argument. This meantpost_processwas never built in the dry-run Build step, so the Test step had to build it from scratch — including fftw, hdf5, and silo dependencies.Fixed by moving precision to an env var that only emits
--singlewhen non-empty.Intel compiler env vars
The Intel
setvars.shsets PATH/LD_LIBRARY_PATH but never exportsFCorCC. Without these, the sys-info cache key step was hashing gfortran/gcc versions instead of ifx/icx, and MFC's build relied on CMake finding compilers via PATH alone.Test plan
ubuntu, mpi, no-debug, trueChemistry tests)ubuntu, no-mpi, single, no-debug, falsepasses (was broken by previous approach)shasumfallback +sysctlfor CPU info)--test-allis active