Skip to content

IBM Feature Improvements and Speedup#1157

Open
danieljvickers wants to merge 80 commits intoMFlowCode:masterfrom
danieljvickers:gpu-optimizations
Open

IBM Feature Improvements and Speedup#1157
danieljvickers wants to merge 80 commits intoMFlowCode:masterfrom
danieljvickers:gpu-optimizations

Conversation

@danieljvickers
Copy link
Member

@danieljvickers danieljvickers commented Feb 17, 2026

User description

Description

Following the refactor of the levelset, there were several performance optimizations left to be made to the code. This PR introduces optimizations that will make multi-particle MIBM code viable. It also expands the upper bound of allowed number of immersed boundaries to 1000. Performance was measured on 1-4 ranks of ACC GPU compute using A100 GPUs.

This PR has extended optimization to STL IBs, which should significantly improve accuracy, performance, and code cleanliness. The primary optimizations are as follows:

  • IB compute is now fully supported on the GPU. Processing of the mesh into memory has been parallelized, as well as IB marker generation and levelset compute.
  • Interpolation of IB markers was removed in favor of a projection-based distance search. This means that we no longer need to interpolate (saving significant per-processing time and memory), we iterate over significantly fewer vertices (by a factor of 50 for the coarsest grids), and we get exact values for the levelset instead of approximations based on how fine the interpolated mesh was.
  • Unified distance and normal compute. We not longer perform a distance search and then a separate normal vector search. This is all handled in a single subroutine, cutting search time in half.
  • Deleting about 1000 lines of code related to the maintenance and compute of STL models, leading to a cleaner base to build on
  • Modified STL generation to be cell center based instead of volume fraction based. This prevents edge cases where the cell center is outside the model, but the point is labeled as inside the mode. This error causes levelsets that point into the immersed boundaries and because instabilities
  • Periodic Immersed Boundary conditions now support immersed boundaries being periodic as well as the fluid.

Type of change

  • New feature
  • Refactor
  • Other: Performance Tuning

Testing

All changes pass the IBM section of the test suite on GPUs with the NVHPC compiler. Performance was measured with a case of 1000 particles with viscosity enabled. The particles are all resolved 3D spheres given random non-overlapping positions generated by the following case file:

import json
import argparse

num_cells = [240, 240, 240]
dim = [8., 8., 8.]
num_particles = 1000

# create a stationary fluid
case = {
    "run_time_info": "T",
    "parallel_io": "T",
    "m": num_cells[0]-1,
    "n": num_cells[1]-1,
    "p": num_cells[2]-1,
    "dt": 0.005,
    "t_step_start": 0,
    "t_step_stop": 2,
    "t_step_save": 1,
    "num_patches": 1,
    "model_eqns": 2,
    "alt_soundspeed": "F",
    "num_fluids": 1,
    "mpp_lim": "F",
    "mixture_err": "T",
    "time_stepper": 3,
    "recon_type": 1,
    "weno_eps": 1e-16,
    "riemann_solver": 2,
    "wave_speeds": 1,
    "avg_state": 2,
    "precision": 2,
    "format": 1,
    "prim_vars_wrt": "T",
    "E_wrt": "T",
    "viscous": "T",
    "x_domain%beg": -0.5*dim[0],
    "x_domain%end": 0.5*dim[0],
    "y_domain%beg": -0.5*dim[1],
    "y_domain%end": 0.5*dim[1],
    "z_domain%beg": -0.5*dim[2],
    "z_domain%end": 0.5*dim[2],
    "bc_x%beg": -3,
    "bc_x%end": -3,
    "bc_y%beg": -3,
    "bc_y%end": -3,
    "bc_z%beg": -3,
    "bc_z%end": -3,
    "patch_icpp(1)%geometry": 9,
    "patch_icpp(1)%z_centroid": 0.0,
    "patch_icpp(1)%length_z": dim[2],
    "patch_icpp(1)%y_centroid": 0.0,
    "patch_icpp(1)%length_y": dim[1],
    "patch_icpp(1)%x_centroid": 0.0,
    "patch_icpp(1)%length_x": dim[0],
    "weno_order": 5,
    "patch_icpp(1)%pres": 1.0,
    "patch_icpp(1)%alpha_rho(1)": 1.0,
    "patch_icpp(1)%alpha(1)": 1.0,
    "patch_icpp(1)%vel(1)": 0.0,
    "patch_icpp(1)%vel(2)": 0.0,
    "patch_icpp(1)%vel(3)": 0.0,
    "fluid_pp(1)%gamma": 2.5000000000000004,
    "fluid_pp(1)%pi_inf": 0.0,
    "fluid_pp(1)%Re(1)": 2500000,
}

import random
random.seed(42)

dx = [float(dim[i]) / float(num_cells[i]) for i in range(3)]

#set particle properties
particle_radius_cells = 5 # particle radius in grid cells
particle_cell_spacing = particle_radius_cells*2 + 5 # set the spacing to be double the radius plus 5 to garuntee no image points in other IBs
mpi_cell_spacing = particle_radius_cells + 5 # space safely away from MPI halo regions to prevent out of bounds errors
genreation_bounds = [6., 6., 6.] # generate particles in this box safely away from the boundary
velocity_magnitude = 1.

# convert non-dimnesional grid cell units to the units of the simulation
radius_units = float(particle_radius_cells) * dx[0]
paticle_units_spacing = float(particle_cell_spacing) * dx[0]
paticle_units_spacing_squared = paticle_units_spacing**2
mpi_units_spacing = [float(mpi_cell_spacing) * dx[i] for i in range(3)]

# generate an array of xyz values that garuntee non-overlapping grid cells
particles = []
while len(particles) < num_particles:
  # generate a completely random position
  position = [random.random() for i in range(3)]
  position = [(position[i] - 0.5) * genreation_bounds[i] for i in range(3)]

  # first check if the particle is too close to the MPI halo regions
  valid = True
  # for i in range(3):
  #   valid = valid and abs(position[i]) >= mpi_units_spacing[i]

  # check for the minimum spacing between all particles, exiting once we find an error
  for particle in particles:
    distance_squared = sum([(particle[i] - position[i])**2 for i in range(3)])
    valid = valid and distance_squared >= paticle_units_spacing_squared
    if not valid:
      break

  if valid:
    particles.append(position)
    # print(f"\rProgress: {100.*float(len(particles))/float(num_particles)}%", end="", flush=True)

# print()
# convert out array of positions to valid JSON for the immersed boundary
ib_properties = {"ib": "T", "num_ibs": num_particles,}
for i in range(len(particles)):
  ib_properties[f'patch_ib({i+1})%radius'] = radius_units
  ib_properties[f'patch_ib({i+1})%slip'] = 'F'
  ib_properties[f'patch_ib({i+1})%geometry'] = 8
  ib_properties[f'patch_ib({i+1})%moving_ibm'] = 2
  ib_properties[f'patch_ib({i+1})%mass'] = 1.
  ib_properties[f'patch_ib({i+1})%x_centroid'] = particles[i][0]
  ib_properties[f'patch_ib({i+1})%y_centroid'] = particles[i][1]
  ib_properties[f'patch_ib({i+1})%z_centroid'] = particles[i][2]
  # move the particle away radially to garuntee they never touch during the simulation
  position_mag = (sum([(particles[i][j])**2 for j in range(3)]))**0.5
  for j in range(3):
    ib_properties[f'patch_ib({i+1})%vel({j+1})'] = particles[i][j] * velocity_magnitude / position_mag

print(json.dumps({**case, **ib_properties}))

These optimizations add nearly x1000 performance in the moving IBM propagation and generation code. Prior to these optimizations, this was the result of the benchmark case using the NVIDIA NSight profiler showing 45 seconds to run a single RK substep:

Screenshot 2026-02-16 at 2 57 19 PM

Following these optimizations, the same profile achieves almost 50 ms per RK substep:
Screenshot 2026-02-17 at 10 51 00 AM

For STLs, the optimizations were tested on a 822,000 vertex mesh of a Mach 0.4 corgi, given by this STL:
https://www.thingiverse.com/thing:4721563/files

The final simulation finished in a total of 25 minutes on a 200^3 grid for 4k time steps on a single A100 GPU. All of the code related to the STL model (file reading, preprocessing, IB marker generation, and levelset compute) took only 20 seconds of the run time. The result of that simulation can be viewed here:
https://www.youtube.com/watch?v=h44BNCKo0Hs

Checklist

  • I updated documentation if user-facing behavior changed

See the developer guide for full coding standards.

GPU changes (expand if you modified src/simulation/)
  • GPU results match CPU results
  • Tested on NVIDIA GPU or AMD GPU

CodeAnt-AI Description

GPU-accelerate STL immersed-boundary compute and support up to 1000 IBs

What Changed

  • STL models are read, transformed, packed into GPU-friendly flat arrays and uploaded so immersed-boundary (IB) marker and levelset checks run on the device instead of on CPU-side model structures
  • Levelset/inside-model queries now use GPU-ready routines (including a GPU-safe random-number step and a flattened triangle intersection path) so per-cell STL inside/distance tests execute on the GPU
  • IB patch routines and levelset application limit loops to grid regions that overlap each patch (binary-searched bounding indices), reducing the number of cells inspected for every patch
  • API and bookkeeping now handle many more IBs: the global patch limit raised to 1000 and model metadata (counts, bounding boxes) are kept in GPU-updatable arrays
  • 2D/3D model distance and normal computation changed to projection-based nearest-point checks (exact triangle/edge projections) and triangle normals are normalized when reading STLs
  • Minor runtime observability: NVTX profiling ranges added around immersed-boundary propagation

Impact

✅ Faster IB marker generation
✅ Lower CPU during IB setup and levelset evaluation
✅ Support up to 1000 immersed boundaries

💡 Usage Guide

Checking Your Pull Request

Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.

Talking to CodeAnt AI

Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:

@codeant-ai ask: Your question here

This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.

Example

@codeant-ai ask: Can you suggest a safer alternative to storing this secret?

Preserve Org Learnings with CodeAnt

You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:

@codeant-ai: Your feedback here

This helps CodeAnt AI learn and adapt to your team's coding style and standards.

Example

@codeant-ai: Do not flag unused imports.

Retrigger review

Ask CodeAnt AI to review the PR again, by typing:

@codeant-ai: review

Check Your Repository Health

To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.

@danieljvickers danieljvickers marked this pull request as ready for review February 19, 2026 18:39
cubic-dev-ai[bot]

This comment was marked as off-topic.

coderabbitai[bot]

This comment was marked as off-topic.

@codeant-ai codeant-ai bot added the size:XL This PR changes 500-999 lines, ignoring generated files label Feb 20, 2026
coderabbitai[bot]

This comment was marked as off-topic.

@codeant-ai codeant-ai bot added size:XL This PR changes 500-999 lines, ignoring generated files and removed size:XL This PR changes 500-999 lines, ignoring generated files labels Feb 20, 2026
cubic-dev-ai[bot]

This comment was marked as off-topic.

coderabbitai[bot]

This comment was marked as off-topic.

cubic-dev-ai[bot]

This comment was marked as off-topic.

danieljvickers and others added 7 commits February 27, 2026 21:57
Take master's compiler_flag detection (for frontier_amd support) and
the PR's dynamic mode selection (gpu → "g", cpu → "c").

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@MFlowCode MFlowCode deleted a comment from github-actions bot Feb 28, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Feb 28, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Feb 28, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Feb 28, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Feb 28, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Feb 28, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Feb 28, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Feb 28, 2026
@sbryngelson
Copy link
Member

sbryngelson commented Feb 28, 2026

Hey @danieljvickers — I went through the code closely and found three bugs that should be fixed before merge. The rest of the PR looks good structurally (the MPI IB marker exchange removal is a valid simplification since each rank now independently computes markers over its full local domain including ghost cells).


Bugs (must fix)

Bug 1: ray_dirs incorrectly includes cell position — src/common/m_model.fpp:558

In f_model_is_inside, the ray direction computation adds point(k) to the random value:

ray_dirs(i, k) = point(k) + f_model_random_number(rand_seed) - 0.5_wp

The intent is a random direction vector centered around zero (random - 0.5), but point(k) (the cell coordinate) is added, which biases the direction toward the cell's absolute position. After normalization on line 560, for a cell at e.g. (5.0, 3.0, 2.0), the "random" direction will be dominated by (5.0, 3.0, 2.0) with only a ±0.5 perturbation — so all spc rays point roughly the same way. The ray origin line immediately above (line 556) correctly uses point(k) for spatial jitter, but the direction should just be:

ray_dirs(i, k) = f_model_random_number(rand_seed) - 0.5_wp

This only affects f_model_is_inside (used in pre_process for IC generation, line 1703 of m_icpp_patches.fpp). The GPU simulation path f_model_is_inside_flat uses fixed deterministic ±1/0 directions and is correct. So the impact is incorrect volume-fraction estimates in pre_process for STL models, especially for cells far from the origin.


Bug 2: stl_bounding_boxes re-allocated inside patch loop — src/common/m_model.fpp:1138

In s_instantiate_STL_models, inside the do patch_id = 1, num_ibs loop:

allocate (stl_bounding_boxes(patch_id, 1:3, 1:3))

stl_bounding_boxes is a single 3D allocatable (real(wp), allocatable :: stl_bounding_boxes(:, :, :) on line 42). On the first STL patch (say patch_id=1), this allocates shape (1, 3, 3). On a second STL patch (patch_id=2), Fortran will hit "already allocated" since the array is still allocated from the first iteration — runtime crash.

The fix is to allocate once before the loop with the full first dimension:

allocate(stl_bounding_boxes(1:num_ibs, 1:3, 1:3))

This only triggers with multiple STL IB patches (geometry 5 or 12), so single-STL cases work fine, which is why tests pass today.


Bug 3: Off-by-one in s_ib_3d_model upper x-bound — src/simulation/m_ib_patches.fpp:1005

ir = m + gp_layers - 1    ! line 1005 — BUG
jr = n + gp_layers + 1    ! line 1006 — correct
kr = p + gp_layers + 1    ! line 1007 — correct

Every other geometry routine in the file (circle, rectangle, sphere, cuboid, cylinder, airfoil, ellipse, 2D STL) consistently uses ir = m + gp_layers + 1. This is a typo: -1 instead of +1. The ir value is the initial upper bound before get_bounding_indices tightens it on line 1036, so it matters when the 3D STL bounding box extends to the rightmost x-cells — those last 2 cells would be excluded from IB marker checking, potentially missing ghost points near the upper x-boundary.

Fix: change line 1005 to ir = m + gp_layers + 1.


Should fix (easy, real risk)

4. Ellipse bounding indices unused — m_ib_patches.fpp:862-870

get_bounding_indices computes il/ir/jl/jr but the GPU loop on lines 869-870 uses hardcoded full-domain bounds (-gp_layers:m+gp_layers). Not a correctness bug — the ellipse condition still works — but the bounding box optimization is bypassed, which matters for the num_ibs=1000 use case. Easy fix: change the loop to do j = jl, jr / do i = il, ir like every other geometry routine.

5. Validator error message — case_validator.py:689

Still says "num_patches_max (10)" but the limit is now 1000.

6. GPU bounds check silenced — m_ibm.fpp (in s_compute_image_points)

The image point bounds check (index < -buff_size .or. index > bound) is wrapped in #if !defined(MFC_OpenACC) && !defined(MFC_OpenMP), so on GPU builds it's completely removed. If an image point lands outside the domain on GPU, the do while loop will silently run off the end of the array — potential memory corruption with no diagnostic. On CPU the code prints a detailed error and aborts. At minimum add a loop iteration cap or index clamp so it doesn't silently corrupt memory on GPU.


Test coverage (ideally before merge, or as fast follow-up)

7. Add a periodic IBM test case

The entire encode/decode/wrapping system (s_encode_patch_periodicity, s_decode_patch_periodicity, s_wrap_periodic_ibs, s_get_periodicities) is new and has zero test coverage. No test case uses bc_x%beg = -1 (periodic) with IBM enabled. A 2D case with a circle near a periodic boundary would exercise this.

8. Add a multi-rank (ppn=2) IBM test

The MPI IB marker exchange removal is architecturally sound — each rank now independently computes markers over its full local domain — but there's zero CI coverage for multi-rank IBM. The alter_ib function in cases.py doesn't push any IBM test with ppn=2. At least one multi-rank IBM test would verify the new approach.

@MFlowCode MFlowCode deleted a comment from github-actions bot Feb 28, 2026
@codecov
Copy link

codecov bot commented Feb 28, 2026

Codecov Report

❌ Patch coverage is 59.44541% with 234 lines in your changes missing coverage. Please review.
✅ Project coverage is 44.94%. Comparing base (ab5082e) to head (03a5ce1).

Files with missing lines Patch % Lines
src/common/m_model.fpp 45.27% 98 Missing and 12 partials ⚠️
src/simulation/m_ib_patches.fpp 72.52% 70 Missing and 5 partials ⚠️
src/simulation/m_ibm.fpp 48.91% 32 Missing and 15 partials ⚠️
src/pre_process/m_icpp_patches.fpp 0.00% 1 Missing ⚠️
src/simulation/m_compute_levelset.fpp 75.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1157      +/-   ##
==========================================
+ Coverage   44.04%   44.94%   +0.90%     
==========================================
  Files          70       70              
  Lines       20499    20498       -1     
  Branches     1993     1943      -50     
==========================================
+ Hits         9028     9213     +185     
+ Misses      10330    10162     -168     
+ Partials     1141     1123      -18     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions
Copy link

Claude Code Review

Head SHA: 3d6dea46696013cece198ea7042d409b6981e120
Files changed: 26
Key files: src/common/m_model.fpp, src/simulation/m_ib_patches.fpp, src/simulation/m_ibm.fpp, src/simulation/m_compute_levelset.fpp, src/common/include/parallel_macros.fpp, src/common/m_derived_types.fpp, src/simulation/m_mpi_proxy.fpp


Summary

  • GPU-accelerates STL IB marker generation and levelset compute, replacing interpolation-based distance search with exact triangle projection (significant correctness and performance improvement).
  • Lifts the IB patch count limit from 10 → 1000 and replaces MPI halo exchange of IB markers with per-rank independent computation into ghost cells.
  • Introduces periodicity encoding in the IB marker value to support periodic IBs and periodic ghost points in levelset subroutines.
  • Adds bounding-index narrowing to avoid looping over the full grid for each IB patch (O(patch footprint) instead of O(grid)).
  • Removes ~1000 lines of interpolation code; restructures s_instantiate_STL_models into m_model.fpp.

Findings

🔴 Critical

1. error stop inside $:GPU_PARALLEL_LOOP — will crash or silently fail on GPU builds
src/simulation/m_ibm.fpp, s_compute_image_points

The print statements are correctly guarded by #if !defined('MFC_OpenACC') && !defined('MFC_OpenMP'), but the error stop statement immediately after is outside that guard:

#endif
    error stop "Ghost Point and Image Point on Different Processors"   ! ← always present

This is inside a $:GPU_PARALLEL_LOOP. On any GPU build (--gpu acc or --gpu mp), this either causes a compile error (nvfortran/ifx reject error stop in device code) or silent GPU kernel abort. Additionally, error stop is forbidden by the coding rules — use call s_mpi_abort() (also guarded by the same #if).


🟠 High

2. Raw allocate in s_pack_model_for_gpu — memory leak, no @:DEALLOCATE pair
src/common/m_model.fpp, s_pack_model_for_gpu (new subroutine, near bottom of file)

allocate(ma%trs_v(1:3, 1:3, 1:ma%ntrs))
allocate(ma%trs_n(1:3, 1:ma%ntrs))

Per project rules, every @:ALLOCATE must have a matching @:DEALLOCATE in finalization. Here raw allocate is used (bypassing the GPU-aware macro) and there is no corresponding deallocation anywhere in the finalization path. These CPU-side flat arrays are also never freed. Use @:ALLOCATE/@:DEALLOCATE and add deallocation to the module finalization subroutine.

3. f_model_is_inside_flat references global p inside a $:GPU_ROUTINE
src/common/m_model.fpp, f_model_is_inside_flat

if (p == 0 .and. k == 0) cycle          ! line ~332
...
if (p == 0) then                         ! line ~353
    fraction = real(nInOrOut)/18._wp

p is a module-level global from m_global_parameters. For this GPU routine to be correct, p must be present on the device. The diff adds $:GPU_DECLARE(create='[x_domain, y_domain, z_domain]') but not p. If p is not already declared as device-resident in an existing GPU_DECLARE, the 2D/3D branch logic will silently use stale CPU data. Verify that p is on-device, or pass it as an explicit argument.

4. Removed MPI halo exchange (s_populate_ib_buffers) — correctness risk in multi-rank runs
src/simulation/m_mpi_proxy.fpp (−187 lines), src/simulation/m_ibm.fpp

The old code called s_mpi_sendrecv_ib_buffers to ensure ghost-cell IB markers were consistent across MPI ranks. The replacement strategy — each rank independently computes IB markers including into its own ghost cells via the extended loop bounds (il = -gp_layers-1, ...) — is correct in principle, provided ib_markers%sf is allocated over the full ghost-cell region (i.e., -buff_size:m+buff_size, not just 0:m). The type integer_field definition and its allocation bounds need confirming; if the array is only 0:m,..., the writes to negative-index cells are out-of-bounds.


🟡 Medium

5. s_check_boundary: GPU_ATOMIC used correctly, but edge_occurrence copy clause overwrites device updates
src/common/m_model.fpp, s_check_boundary

$:GPU_PARALLEL_LOOP(private='[i,j]', copy='[temp_boundary_v,edge_occurrence]', collapse=2)

The copy clause will copy edge_occurrence to device (all zeros) and back after the loop — this is correct. However, temp_boundary_v appears to be read-only inside the loop body. Using copyin instead of copy for temp_boundary_v would avoid unnecessary device→host copy overhead.

6. get_bounding_indices: second binary search reuses itr_left from first search without documenting the intent
src/simulation/m_ib_patches.fpp, get_bounding_indices (~line 795 in diff)

After the first loop sets left_index = itr_left, the second loop begins with itr_right = right_index but itr_left still holds its post-first-loop value. This is intentional (right bound ≥ left bound, so starting from the found left is valid), but it is subtle and has no comment. A correctness edge case: if right_bound < cell_centers(itr_left) the second loop terminates immediately with right_index = right_index (unchanged, full extent), which would be wrong. Please add an assertion or comment.

7. s_get_periodicities: uninitialized output when called with optional zp args while p == 0
src/simulation/m_ib_patches.fpp, s_get_periodicities

if (present(zp_lower) .and. p /= 0) then
    ...
end if

If zp_lower/zp_upper are present but p == 0, both remain uninitialized. The current caller in the 3D branch is only reached when p > 0, so this is safe today. Add an else; zp_lower = 0; zp_upper = 0; end if to make this unconditionally correct.


🔵 Minor / Nit

  • s_pack_model_for_gpu ends with end subroutine without a trailing name (convention in this codebase is end subroutine s_pack_model_for_gpu).
  • Docstring typo in f_model_is_inside_flat: "instide" → "inside", "perfentage" → "percentage".
  • s_ib_model loop order swapped: the GPU loop body has do i = il, ir / do j = jl, jr (i outermost), while all other 2D patches use do j / do i. Should be consistent to match collapse semantics and cache access patterns.
  • Floating-point literals 1e12 and -1e12 in s_ib_model/s_ib_3d_model should be 1.e12_wp and -1.e12_wp for precision safety (though they pass the precheck since they are real default literals, not d-exponent).

Review covers 26 changed files (+1662 / −1615 lines). IBM test suite is confirmed to pass on NVHPC GPU.

@github-actions
Copy link

Claude Code Review

Head SHA: 1883b9b
Files changed: 26

Files

src/common/m_model.fpp, src/simulation/m_ib_patches.fpp, src/simulation/m_ibm.fpp, src/simulation/m_compute_levelset.fpp, src/simulation/m_mpi_proxy.fpp, src/common/include/parallel_macros.fpp, src/common/m_derived_types.fpp, src/common/m_constants.fpp, src/common/m_helper.fpp, src/simulation/m_global_parameters.fpp, src/simulation/m_data_output.fpp, src/simulation/m_time_steppers.fpp, src/pre_process/m_icpp_patches.fpp, +13 more (golden files, toolchain, docs, CI scripts)

Summary

  • Full GPU offload of STL IB marker generation and levelset compute via flat array packing (gpu_trs_v, gpu_trs_n) uploaded to device; replaces the old interpolation-based approach with exact projection-based distance search.
  • Ghost-point and image-point loops (s_find_num_ghost_points, s_find_ghost_points, s_compute_interpolation_coeffs) are now GPU-parallelized with atomic capture for unique index assignment.
  • IB marker halo MPI exchange (s_mpi_sendrecv_ib_buffers, s_populate_ib_buffers) is removed; each rank now independently fills its own ghost-cell region via extended bounding-box loops.
  • Periodic IB support extended to STL models; periodicity encoded in the marker integer and decoded to offset centroid at levelset compute time.
  • num_patches_max raised from 10 → 1000; x_domain/y_domain/z_domain added to GPU-resident data.

Findings

1. error stop violates project rule — must use call s_mpi_abort()

File: src/simulation/m_ibm.fpp (diff line ~3027)

Project rules forbid error stop (and stop) everywhere — use call s_mpi_abort() or @:PROHIBIT(). precheck will reject this. The change to a post-loop CPU check (replacing the in-loop error stop) is the right GPU-safe pattern; just change the final abort call:


2. Memory leak in s_pack_model_for_gpu

File: src/common/m_model.fpp (new subroutine near end of file)

ma%trs_v and ma%trs_n are allocated here but never freed — neither in this subroutine nor after the flat GPU arrays are populated in s_instantiate_STL_models. After the GPU upload loop these per-model staging buffers are unused. They should be deallocated (with plain deallocate since these are CPU staging arrays, not GPU-resident) to avoid memory pressure proportional to the total number of STL triangles × number of IBs.


3. Cylindrical-geometry STL levelset silently dropped

File: src/simulation/m_compute_levelset.fpp (diff lines ~664–667)

The old code in s_model_levelset contained:

This conversion is removed. STL models in cylindrical domains (grid_geometry == 3) will now compute distances in the wrong coordinate system. If this combination was never supported/tested, a @:PROHIBIT(patch_ib(patch_id)%geometry == 5 .or. patch_ib(patch_id)%geometry == 12, "STL IBs are not supported in cylindrical geometry.") guard would prevent silent incorrect results.


4. s_check_boundary GPU loop over locally-allocated CPU arrays — verify memory safety

File: src/common/m_model.fpp (diff line ~446)

temp_boundary_v and edge_occurrence are local allocatables. OpenACC's copy clause on local allocatables is handled correctly by nvfortran but can be problematic with Cray ftn. Since this subroutine is also called from pre_process (CPU only), the macro correctly expands to a no-op there. However, the edge-occurrence parallelism is O(N²) in boundary edges; for large 2D models this may still be slow on GPU. The $:GPU_ATOMIC(atomic='update') on edge_occurrence(i) is correct since multiple threads share the same i bin.


5. END_GPU_ATOMIC_CAPTURE is correct but the capture block spans two statements — verify compiler support

File: src/common/include/parallel_macros.fpp and src/simulation/m_ibm.fpp

OpenACC requires ! atomic capture! end atomic to wrap exactly the increment and read as a two-statement block. The macro expansion for OpenACC and OpenMP target offload appears correct. This pattern works on NVIDIA nvfortran; please confirm it compiles and produces correct results with Cray ftn (OpenMP path) before merge, as some versions of CCE have had issues with capture atomics on two-statement blocks.


Improvement Opportunities

  • stl_bounding_boxes is allocated and populated in s_instantiate_STL_models but does not appear in any GPU kernel or IB loop visible in the diff. If it is used downstream in a per-patch bounding-index optimization, the intent should be documented; if it is not yet used, it could be deferred to avoid unnecessary allocation.

  • x_domain/y_domain/z_domain are now GPU-declared in m_global_parameters.fpp, but there is no corresponding GPU_UPDATE(device=...) at initialization. Confirm these structs are populated before the first GPU kernel that reads them (likely fine since they are set once at startup before any GPU work, but worth an explicit GPU_UPDATE call for safety).

@github-actions
Copy link

Claude Code Review

Head SHA: 5b1aa96147fdec20dc1c53fe395a68bfae6f064d

Files changed: 26

File list
  • .github/workflows/frontier/build.sh
  • .github/workflows/frontier_amd/build.sh
  • docs/documentation/case.md
  • src/common/include/parallel_macros.fpp
  • src/common/m_constants.fpp
  • src/common/m_derived_types.fpp
  • src/common/m_helper.fpp
  • src/common/m_model.fpp
  • src/pre_process/m_icpp_patches.fpp
  • src/simulation/m_compute_levelset.fpp (+ 9 more)

Summary

  • Replaces interpolation-based STL distance search with exact projection-based nearest-point computation, eliminating ~1000 lines of vertex-generation code and enabling full GPU offload.
  • Packs STL triangle data into flat 4D arrays (gpu_trs_v, gpu_trs_n, gpu_boundary_v) so IB marker and levelset routines run inside GPU_PARALLEL_LOOP regions.
  • Adds a periodicity encoding/decoding scheme so periodic copies of each IB are swept simultaneously in the marker loop instead of via MPI exchange.
  • Increases num_ibs limit to 1000 (Python toolchain) and raises num_patches_max to 1000 (Fortran constant).
  • Parallelises s_find_ghost_points, s_find_num_ghost_points, and s_compute_image_points onto the GPU.

Findings

1. Forbidden — CRITICAL

File: src/simulation/m_ibm.fpp (diff line ~3027)

error stop is explicitly forbidden (CLAUDE.md, fortran-conventions.md). Must be replaced with call s_mpi_abort('Ghost Point and Image Point on Different Processors.'). Using error stop in an MPI simulation leaves other ranks hanging without finalizing MPI, which corrupts job schedulers and output files.


2. Named from inner loop inside — HIGH

File: src/simulation/m_ibm.fpp (both s_find_ghost_points and s_find_num_ghost_points)

A named exit from a multi-level inner loop construct is valid Fortran and works on NVIDIA nvfortran, but Cray CCE and Intel ifx have known limitations with named construct exits inside target regions. This will compile but may silently execute without the early-exit optimisation on Cray/AMD flang, or fail to compile at all. Recommend replacing with the established idiom of a boolean flag and if (.not. found) guards on each level, which is portable across all four CI-gated compilers.


3. raised to 1000 inflates IC patch arrays in all three executables — MEDIUM

File: src/common/m_constants.fpp line 26

num_patches_max controls the size of patch_icpp(1:num_patches_max) arrays in pre_process, simulation, and post_process. These are initial-condition patch arrays, not IB patch arrays. The IB count was correctly bumped via NI = 1000 in definitions.py and num_ibs max: 1000 in case_validator.py.

Unless patch_icpp arrays are actually dimensioned by num_patches_max and expected to hold 1000 entries, this 100× increase is unintentional bloat in all three executables. Please verify the intended scope: if this constant only guards IB patches, rename it num_ibs_max or use the existing num_ibs runtime value.


4. Unused variables and in — MEDIUM

File: src/common/m_model.fpp (diff line ~243)

These are leftovers from a commented-out fibonacci-sphere ray-direction generator. They are never assigned or read, producing compiler warnings on gfortran/ifx (and potential errors with -Werror or strict flags). Remove them.


5. without name for — MINOR

File: src/common/m_model.fpp (diff line ~1332)

All other subroutines in the file use end subroutine <name>. This should be end subroutine s_pack_model_for_gpu to satisfy the convention and improve readability.


6. leaves / uninitialised when and z args are passed — MINOR

File: src/simulation/m_ib_patches.fpp

In practice the 3D call path is guarded by if (p > 0) so this cannot trigger today, but defensive initialisation (zp_lower = 0; zp_upper = 0 before the guard) would prevent a latent bug if the call sites change.


Positive observations

  • The projection-based distance/normal algorithm (s_distance_normals_3D, s_distance_normals_2D) is mathematically clean and handles the edge/vertex fallback correctly.
  • The periodicity encoding scheme (base-num_ibs+1 mixed-radix encoding for 3×3×3 periodicity combinations) is numerically correct and efficiently decoded in the GPU routine s_decode_patch_periodicity.
  • Bounding-index binary search (get_bounding_indices) correctly narrows the loop range per patch, which is the primary driver of the reported ~1000× speedup.
  • Adding model_threshold, model_spc, model_translate, model_scale, and model_filepath to the MPI broadcast list in m_mpi_proxy.fpp fixes a pre-existing multi-rank correctness gap.
  • The frontier build.sh fix (selecting GPU vs CPU module mode based on job_device) is correct.

🤖 Generated with Claude Code

@github-actions
Copy link

Claude Code Review

Head SHA: 53ebfb7
Files changed: 30

Changed files
  • src/common/m_model.fpp (+475/-487)
  • src/simulation/m_ib_patches.fpp (+540/-304)
  • src/simulation/m_ibm.fpp (+271/-281)
  • src/simulation/m_mpi_proxy.fpp (+5/-187)
  • src/simulation/m_compute_levelset.fpp (+40/-86)
  • src/common/m_derived_types.fpp, m_constants.fpp, m_helper.fpp
  • src/pre_process/m_icpp_patches.fpp, src/simulation/m_global_parameters.fpp, etc.
  • tests golden files, toolchain/cases.py, docs

Summary

  • GPU-accelerates STL IB marker generation and levelset compute by packing triangle/boundary data into flat device-resident arrays (gpu_trs_v, gpu_trs_n, etc.)
  • Replaces interpolation-based distance search with exact projection-based nearest-point algorithm (3D: triangle projection + edge/vertex fallback; 2D: edge projection)
  • Adds per-patch bounding-box culling in all IB marker subroutines to limit GPU work to relevant grid cells
  • Extends num_patches_max from 10 → 1000 and adds periodic IB support via encoded patch IDs
  • Removes s_populate_ib_buffers MPI exchange (now redundant because each rank independently computes markers including ghost-cell-extended bounds)

Findings

1. Global p used inside GPU_ROUTINE without guaranteed device residency — , ~line 332 of diff (function f_model_is_inside_flat)

if (p == 0 .and. k == 0) cycle
...
if (p == 0) then
    fraction = real(nInOrOut)/18._wp
else
    fraction = real(nInOrOut)/26._wp
end if

p is a global from m_global_parameters used inside $:GPU_ROUTINE(parallelism='[seq]'). If p has not been uploaded to the device (it is not in the newly added GPU_DECLARE list), GPU execution will silently read a stale or zero value. Verify that p (and m, n) are in an existing GPU_DECLARE before this PR, or add it here.


2. Raw allocate without @:ALLOCATE — memory leak and missing GPU mirroring — , s_pack_model_for_gpu (~line 1318) and s_instantiate_STL_models

allocate(ma%trs_v(1:3, 1:3, 1:ma%ntrs))
allocate(ma%trs_n(1:3, 1:ma%ntrs))

and

allocate(stl_bounding_boxes(num_ibs, 1:3, 1:3))

Per project rules, every @:ALLOCATE must have a matching @:DEALLOCATE. Using raw allocate here skips the GPU_ENTER_DATA that the macro provides and leaves no pairing for finalization. The models(:) array itself is also allocated with plain allocate. stl_bounding_boxes and the trs_v/trs_n fields are allocated every call to s_ibm_setup (for moving IBs) but never freed.


3. Precision violation: bare 1e12/-1e12 literals — , s_ib_model and s_ib_3d_model

bbox_min = 1e12
bbox_max = -1e12

These are default-real (likely single-precision) literals. Should be 1.e12_wp and -1.e12_wp per the precision rules. The precheck lint step will flag this.


4. Cylindrical grid conversion silently removed from STL levelset — , s_model_levelset

-        if (grid_geometry == 3) then
-            xyz_local = f_convert_cyl_to_cart(xyz_local)
-        end if

This block was removed with no comment or guard. Any STL IB used with grid_geometry == 3 (cylindrical) will now apply the distance/normal calculation in the wrong coordinate space. If cylindrical STL IBs are intentionally unsupported, a `@:PROHIBIT(grid_geometry == 3 .and. (patch_ib(patch_id)%geometry == 5 .or. ...))\ should guard against that.


5. 3D airfoil bounding-box z-search uses chord length instead of span — , s_ib_3D_airfoil

call get_bounding_indices(center(3) - ca_in, center(3) + ca_in, z_cc, ll, lr)

ca_in is the chord length. The z-extent of a 3D airfoil is lz (span), not the chord. If lz > 2*ca_in, cells at the ends of the airfoil span will be missed by the marker loop, leaving the IB unmarked. Should be lz/2 (or z_max/z_min already computed just above in that subroutine).


6. Unused variable after interpolation removal — , line ~1345

logical :: interpolate !< Logical variable to determine whether or not the model should be interpolated

All uses of interpolate were removed but the declaration was not. This will produce a compiler warning and may fail Intel ifx's pedantic mode.


7. Potential division by zero in s_distance_normals_2D

In the t < 0 branch:

dist = sqrt((point(1) - v1(1))**2 + (point(2) - v1(2))**2)
norm = norm/dist   ! <-- div-by-zero if point == v1

and the t > 1 branch likewise for v2. If point exactly equals a vertex, dist == 0 and the division produces NaN/Inf. A guard if (dist > 0._wp) norm = norm/dist is needed.


Minor / Suggestions

  • is added to but does not appear to be used anywhere in this PR — can be deferred to when it's actually needed.
  • has a commented-out — either enable it or remove the comment to avoid confusion for future readers.
  • has optional intent(out) arguments for zp_lower/zp_upper; if they are present but p == 0, those output values are never set (caller would read uninitialized). The 3D call-sites always pass them alongside a p > 0 guard, so this is safe today, but worth a comment or guard in the subroutine body.

@github-actions
Copy link

github-actions bot commented Mar 1, 2026

Claude Code Review

Head SHA: 03e65ed

Files changed: 36

Key files:

  • src/common/m_model.fpp (+475/-487) — GPU-friendly flat arrays, new distance/normal routines
  • src/simulation/m_ib_patches.fpp (+540/-304) — periodicity encoding, bounding-index optimisation
  • src/simulation/m_ibm.fpp (+271/-281) — ghost point parallelisation
  • src/simulation/m_compute_levelset.fpp (+40/-86) — STL levelset on GPU
  • src/simulation/m_mpi_proxy.fpp (+5/-187) — MPI IB buffer exchange removed
  • src/common/include/parallel_macros.fpp — new END_GPU_ATOMIC_CAPTURE macro
  • toolchain/mfc/test/cases.py (+213/-176)

Summary

  • GPU-parallelises STL IB marker generation and levelset compute by packing triangle/edge data into flat GPU arrays (gpu_trs_v, gpu_trs_n, gpu_boundary_v).
  • Replaces interpolation-based distance search with exact projection onto triangle/edge primitives, cutting preprocessing time significantly.
  • Unifies distance + normal computation into single pass (s_distance_normals_3D/2D).
  • Raises num_patches_max from 10 → 1000 and implements a periodicity-encoding scheme in ib_markers so periodic images are handled without MPI halo exchange.
  • Removes the MPI s_mpi_sendrecv_ib_buffers path entirely; each rank now independently marks its own ghost-cell buffer region.

Findings

1. Forbidden — must fix (CLAUDE.md rule)

File: src/simulation/m_ibm.fpp, s_compute_image_points

error stop is explicitly forbidden. Replace with call s_mpi_abort() (or @:PROHIBIT(bounds_error, "...")). The previous version also used error stop (removed in this PR), but the replacement still violates the rule.


2. Missing @:DEALLOCATE for GPU arrays — must fix

File: src/common/m_model.fpp, s_instantiate_STL_models

Six arrays are allocated with @:ALLOCATE:
gpu_ntrs, gpu_trs_v, gpu_trs_n, gpu_boundary_v, gpu_boundary_edge_count, gpu_total_vertices

No matching @:DEALLOCATE is visible anywhere in the PR (neither in a module finalizer nor in s_model_free). Per CLAUDE.md: Every @:ALLOCATE MUST have a matching @:DEALLOCATE.

Similarly, stl_bounding_boxes is allocated with raw allocate() (not @:ALLOCATE) and is never freed.


3. Raw allocate() in s_pack_model_for_gpu without deallocation — must fix

File: src/common/m_model.fpp, s_pack_model_for_gpu

Raw allocate instead of @:ALLOCATE. No matching deallocation visible. These arrays are intermediate host buffers packed into gpu_trs_v/gpu_trs_n, so GPU upload is correct, but the host-side memory leaks.


4. s_check_boundary GPU_PARALLEL_LOOP correctness concern

File: src/common/m_model.fpp, s_check_boundary

The copy clause on temp_boundary_v copies the full boundary vertex array to/from device for every call — acceptable for preprocessing but worth noting for large meshes. More importantly, verify that the GPU_ATOMIC(atomic='update') wraps only the increment expression (no statement before/after inside the atomic scope), as OpenACC/OpenMP atomic update requires a single assignment statement.


5. t_model_array%interpolate changed from logical to integer without comment

File: src/common/m_derived_types.fpp

The field type changed silently. If any code outside this PR still treats it as logical, this is a latent type bug. Confirm no remaining callers use it as logical, or add a comment explaining the change.


6. Removed MPI IB marker halo exchange — needs explicit justification in comments

File: src/simulation/m_mpi_proxy.fpp (−182 lines), src/simulation/m_ibm.fpp

s_populate_ib_buffers / s_mpi_sendrecv_ib_buffers was the mechanism that exchanged ib_markers across MPI subdomain boundaries. It is removed without a comment explaining why it is safe to do so. The justification (each rank now marks [-gp_layers-1, m+gp_layers+1] independently using global IB geometry) should be documented explicitly, as future contributors may otherwise reintroduce it.


7. Missing x_domain, y_domain, z_domain GPU declaration before they existed

File: src/simulation/m_global_parameters.fpp

$:GPU_DECLARE(create='[x_domain, y_domain, z_domain]') is added. These are type(bounds_info) scalars. Confirm that bounds_info is a plain derived type with no pointer/allocatable components, so it can be declared on device safely across all four CI compilers (gfortran/nvfortran/Cray/ifx).


Minor notes (no blocking issues)

  • f_model_random_number has ! $:GPU_ROUTINE(...) commented out — this is intentional (only called from CPU path in f_model_is_inside). Consider a short comment explaining why.
  • The END_GPU_ATOMIC_CAPTURE macro in parallel_macros.fpp is used only in s_find_ghost_points; it follows existing patterns and should pass the precheck lint.
  • Frontier/AMD build scripts: the load -m g → conditional fix is correct and unrelated to the IBM changes.

@github-actions
Copy link

github-actions bot commented Mar 1, 2026

Claude Code Review

Head SHA: 2ace4c0
Files changed: 36 | Key source files: m_ibm.fpp, m_ib_patches.fpp, m_model.fpp, m_compute_levelset.fpp, m_mpi_proxy.fpp, parallel_macros.fpp, m_constants.fpp

Summary

  • Fully GPU-offloads STL IB marker generation and levelset compute — a major architecture improvement
  • Replaces interpolation-based distance search with projection-based nearest-point search, reducing search vertices ~50× and providing exact levelset values
  • Raises num_patches_max 10→1000, enabling 1000-particle MIBM runs
  • Removes ~1000 lines of MPI-based IB marker exchange in favor of on-device compute
  • New periodic IB boundary support encoded via signed ib_markers%sf values

Findings

🔴 Critical — Must Fix

1. error stop is forbidden — src/simulation/m_ibm.fpp, s_compute_image_points

if (bounds_error) error stop "Ghost Point and Image Point on Different Processors. Exiting"

error stop is explicitly forbidden (CLAUDE.md). It does not guarantee clean MPI finalization and will hang GPU jobs. Replace with:

@:PROHIBIT(bounds_error, "Ghost Point and Image Point on Different Processors.")

or call s_mpi_abort().


2. @:ALLOCATE without matching @:DEALLOCATEsrc/common/m_model.fpp, src/simulation/m_ibm.fpp

s_instantiate_STL_models allocates (via @:ALLOCATE, which also calls GPU_ENTER_DATA) the following module-level arrays, but s_finalize_ibm_module has no matching @:DEALLOCATE for any of them:

  • gpu_ntrs(1:num_ibs)
  • gpu_trs_v(1:3, 1:3, 1:max_ntrs, 1:num_ibs)
  • gpu_trs_n(1:3, 1:max_ntrs, 1:num_ibs)
  • gpu_boundary_edge_count(1:num_ibs)
  • gpu_total_vertices(1:num_ibs)
  • gpu_boundary_v(...) (conditionally allocated)

Additionally, stl_bounding_boxes and the ma%trs_v / ma%trs_n arrays inside s_pack_model_for_gpu are allocated via plain allocate (bypassing @:ALLOCATE) and are never freed. GPU device memory for the global arrays is permanently leaked every run.


🟠 High — Should Fix

3. 1e12 / -1e12 literals missing _wp kind — src/simulation/m_ib_patches.fpp

Four occurrences in s_ib_model (2D) and s_ib_3d_model (3D):

bbox_min = 1e12    ! should be 1.e12_wp
bbox_max = -1e12   ! should be -1.e12_wp

Default-kind float literals violate precision discipline. Also in src/common/m_model.fpp, f_model_is_inside_flat:

fraction = real(nInOrOut)/18._wp   ! real(nInOrOut) → default real; should be real(nInOrOut, wp)
fraction = real(nInOrOut)/26._wp

4. Cylinder bounding box uses full axial length instead of half-length — src/simulation/m_ib_patches.fpp, s_ib_cylinder

corner_distance = sqrt(radius**2 + maxval(length)**2)

maxval(length) is the full cylinder length L, but the distance from the centroid to the rim is sqrt(r² + (L/2)²). Using L instead of L/2 doubles the axial bounding box extent, causing GPU_PARALLEL_LOOP to iterate over up to 4× more cells than necessary — directly harming the per-patch performance that this PR sets out to improve.

Fix: corner_distance = sqrt(radius**2 + (maxval(length)/2._wp)**2)


5. num_patches_max = 1000 creates O(N²) static memory — src/common/m_constants.fpp

ic_patch_parameters contains logical, dimension(0:num_patches_max-1) :: alter_patch. With num_patches_max = 1000, the static module array patch_icpp(1:1000) in pre_process has 1000 structs each with a 1000-element logical array, consuming ~4 MB of static data. All startup init loops over num_patches_max are now 100× more iterations. Please confirm this O(N²) memory growth is acceptable or address with dynamic allocation.


6. Data race on ib_markers%sf in s_find_ghost_points GPU_PARALLEL_LOOP — src/simulation/m_ibm.fpp

The GPU parallel loop simultaneously reads ib_markers%sf(ii, jj, kk) (neighbors) and writes ib_markers%sf(i, j, k) = patch_id. The read checks == 0 (fluid cell) while both encoded_patch_id and patch_id are nonzero for IB cells, so the race appears benign in practice — but it is undocumented and fragile. A comment explaining why the race is safe (or a two-pass implementation) would prevent future regressions.


🟡 Medium

7. Unguarded print * prints from all MPI ranks — src/common/m_model.fpp, s_instantiate_STL_models

print *, " * Reading model: "//trim(patch_ib(patch_id)%model_filepath)

Not guarded by if (proc_rank == 0). On a 1000-rank run with STL patches, this floods stdout. Other prints nearby are correctly guarded. Add the rank guard.


🔵 Low / Informational

8. END_GPU_ATOMIC_CAPTURE new macro — src/common/include/parallel_macros.fpp

The new END_GPU_ATOMIC_CAPTURE() macro correctly provides the end atomic terminator required by OpenACC/OpenMP atomic capture blocks. The usage in s_find_ghost_points (incrementing count and capturing local_idx) is semantically correct and GPU-safe.

9. get_bounding_indices naming convention — src/simulation/m_ib_patches.fpp

Module-private subroutine lacks the s_ prefix per project convention. Minor.

10. Frontier/Frontier-AMD build.sh CPU mode fix — .github/workflows/frontier*/build.sh

The change from hardcoded -m g to -m $([ "$job_device" = "gpu" ] && echo "g" || echo "c") is correct and fixes CPU-only CI builds on Frontier.


Required Before Merge

  • Replace error stop with @:PROHIBIT() or call s_mpi_abort()
  • Add @:DEALLOCATE for all new GPU arrays in s_finalize_ibm_module
  • Fix 1e12/real(nInOrOut) precision literals
  • Fix cylinder half-length bounding box
  • Guard print * with proc_rank == 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XXL This PR changes 1000+ lines, ignoring generated files

Development

Successfully merging this pull request may close these issues.

2 participants