"ACC: ERROR non contiguous transfer" from internal/ltinv_mod.F90:366

### What happened?

Running ectrans on GPU with default arguments works fine.
But as soon as we scale up, we run into problems regarding memory transfers with the following error:

```
ACC: ERROR non contiguous transfer from ../../../pfs/lustrep2/scratch/project_462000713/ectrans-main-acc/src/ectrans-HEAD^1.6.0.src/build/src/trans/gpu/generated/ectrans_gpu_dp/internal/ltinv_mod.F90:366
```

This was the error message for double precision, but we observe the same behavior for single precision:

```
ACC: ERROR non contiguous transfer from ../../../pfs/lustrep2/scratch/project_462000713/ectrans-main-acc/src/ectrans-HEAD^1.6.0.src/build/src/trans/gpu/generated/ectrans_gpu_sp/internal/ltinv_mod.F90:366
```

This line corresponds to the following call to `PRFI1B`:
```bash
364         IF(PRESENT(PSPSC3A) .AND. NF_SC3A > 0) THEN
365           DO J3=1,UBOUND(PSPSC3A,3)
366             CALL PRFI1B(PSCALARS(IFIRST:IFIRST+2*NF_SC3A-1,:,:),PSPSC3A(:,:,J3),NF_SC3A,UBOUND(PSPSC3A,2))
367             IFIRST  = IFIRST+2*NF_SC3A
368           ENDDO
369         ENDIF
```

### What are the steps to reproduce the bug?

# Environment
We are running on LUMI `dev-g` using 1 node, and load the following modules before the installation and execution:
```bash
module load LUMI/24.03 partition/G PrgEnv-cray
module load cpe/24.03 craype-x86-trento craype-accel-amd-gfx90a
module load cray-mpich cray-libsci cray-fftw cray-python
module load buildtools
module load rocm/6.0.3
module load cce/17.0.1
```

These are different from the ones we found in the GitHub actions, so we also tried those, but unfortunately, both led to the same error.  
The ones from GitHub actions for reference:
```bash
module load CrayEnv
module load PrgEnv-cray
module load cce/17.0.1
module load craype-accel-amd-gfx90a
module load rocm/6.0.3
module load cray-fftw
module load buildtools
```

We also set the following environment variables:
```bash
export FC=ftn
export FC90=ftn
export CXX=CC
export CC=cc
```

# Installation
The subsections below contain the installation instructions for each module.
Note that we set environment variables for linking OpenMP, which are isolated per module.

## Ecbuild
Version: 3.8.5  
Installation through `git pull`, followed by:
```bash
mkdir -p build
cd build
cmake .. 
make clean
make -j16
make install
```

## Fiat
Version: 1.4.1  
Installation through `git pull` followed by:
```bash
export LDFLAGS="-fopenmp"
mkdir -p build
cd build
$DNB_INSTALL_DIR/ecbuild.bin/bin/ecbuild \
    -DCMAKE_BUILD_TYPE=Release \
    -DENABLE_MPI=ON \
    -DENABLE_OMP=ON \
    -DENABLE_TESTS=OFF \
    .. 
make clean
make -j16
make install
```

## Ectrans
Version: 1.6.1  
Installation through `git pull` followed by:
```bash
export LDFLAGS="-fopenmp -lcraymp"
mkdir -p build
cd build
$DNB_INSTALL_DIR/ecbuild.bin/bin/ecbuild \
    -DCMAKE_BUILD_TYPE=Release \
    -Dfiat_ROOT=$DNB_INSTALL_DIR/fiat.bin \
    -DENABLE_TESTS=OFF \
    -DENABLE_SINGLE_PRECISION=ON \
    -DENABLE_SINGLE_PRECISION=ON \
    -DENABLE_MKL=OFF \
    -DENABLE_FFTW=OFF \
    -DENABLE_OMP=OFF \
    -DENABLE_ACC=ON \
    -DENABLE_ACC=ON \
    -DENABLE_GPU_AWARE_MPI=ON \
    .. 
make clean
make -j16
make install
```

# Execution
These are examples of our executions using double precision, but it works the same for single precision.
All executions used 1 node, and the environment described above is loaded beforehand.

When we run with default arguments, the following line is executed:
```bash
numactl -l --all --physcpubind=49-55 -- ./bin/ectrans-benchmark-gpu-dp --norms --nlev 137 --vordiv --scders -t 600 --niter 20
```
The `numactl` is required on LUMI for optimal bindings.
For our "scaled-up" executions, we run the following:
```bash
numactl -l --all --physcpubind=49-55 -- ./bin/ectrans-benchmark-gpu-dp --vordiv --scders --uvders --nfld 1 --norms --niter 10 --nlev 79 --truncation 1279
```

### Version

v1.6.0

### Platform (OS and architecture)

LUMI, Cray-OS, running kernel 5.14.21-150500.55.49_13.0.56-cray_shasta_c

### Relevant log output

```shell
ACC: ERROR non contiguous transfer from ../../../pfs/lustrep2/scratch/project_462000713/ectrans-main-acc/src/ectrans-HEAD^1.6.0.src/build/src/trans/gpu/generated/ectrans_gpu_dp/internal/ltinv_mod.F90:366
```

### Accompanying data

_No response_

### Organisation

Barcelona Supercomputing Center

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"ACC: ERROR non contiguous transfer" from internal/ltinv_mod.F90:366 #262

What happened?

What are the steps to reproduce the bug?

Environment

Installation

Ecbuild

Fiat

Ectrans

Execution

Version

Platform (OS and architecture)

Relevant log output

Accompanying data

Organisation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

"ACC: ERROR non contiguous transfer" from internal/ltinv_mod.F90:366 #262

Description

What happened?

What are the steps to reproduce the bug?

Environment

Installation

Ecbuild

Fiat

Ectrans

Execution

Version

Platform (OS and architecture)

Relevant log output

Accompanying data

Organisation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions