Add Subgroup/Warp Shuffle Support to OpenCL and PTX Backends

## Summary

TornadoVM currently exposes SIMD subgroup operations through `KernelContext`, enabling efficient warp-level reductions using Metal's `simd_sum` and `simd_shuffle_down`.

**Example (already working in Metal backend #796 ):**

```java
float val = input.get(globalIdx);
val += ctx.simdShuffleDown(val, 16);
val += ctx.simdShuffleDown(val, 8);
val += ctx.simdShuffleDown(val, 4);
val += ctx.simdShuffleDown(val, 2);
val += ctx.simdShuffleDown(val, 1);
```

These operations map to Metal simdgroup intrinsics and allow register-level communication across lanes, avoiding shared/threadgroup memory.

**Goal:** Implement equivalent functionality in the **OpenCL** and **PTX** backends.

---

## Motivation

Warp/subgroup primitives are essential for:

- Reductions
- Scans
- Warp-aggregated atomics
- Attention kernels
- Matrix tiling
- Tensor core pipelines

They eliminate shared memory traffic and synchronization barriers.

Benchmark results show **~1.6× speedup** vs threadgroup memory reductions.

---

## Proposed TornadoVM API (Already Implemented)

Current exposed primitives:

```java
float simdSum(float value)
float simdShuffleDown(float value, int delta)
```

Used via `KernelContext ctx`.

---

## Required Backend Implementations

### 1. OpenCL Backend

OpenCL provides equivalent functionality through subgroup operations, available via the `cl_khr_subgroups` extension or OpenCL 3.0 core.

#### Required Mappings

| TornadoVM | OpenCL |
|---|---|
| `simdSum(x)` | `sub_group_reduce_add(x)` |
| `simdShuffleDown(x, d)` | `sub_group_shuffle_down(x, d)` |

#### Example OpenCL Kernel

```c
float val = input[gid];

val += sub_group_shuffle_down(val, 16);
val += sub_group_shuffle_down(val, 8);
val += sub_group_shuffle_down(val, 4);
val += sub_group_shuffle_down(val, 2);
val += sub_group_shuffle_down(val, 1);
```

OpenCL subgroup extensions allow data exchange between work-items in a subgroup without using local memory, improving performance compared to manual reductions.

#### Backend Tasks

Implement in:

- `OpenCLIntrinsicNode`
- `OpenCLCodeGenerator`

Add mappings:

- `KernelContext.simdShuffleDown` → `sub_group_shuffle_down`
- `KernelContext.simdSum` → `sub_group_reduce_add`

#### Required Capability Detection

At runtime, query:

- `CL_DEVICE_SUB_GROUP_SIZES_INTEL`
- `CL_DEVICE_MAX_NUM_SUB_GROUPS`
- `CL_DEVICE_SUB_GROUP_INDEPENDENT_FORWARD_PROGRESS`

Or check for:

- `__opencl_c_subgroups`
- `cl_khr_subgroups`

---

### 2. PTX Backend

#### CUDA PTX Equivalent Instructions

| TornadoVM | PTX |
|---|---|
| `simdShuffleDown` | `shfl.sync.down` |
| `simdSum` | warp reduction via shuffle |

#### Example PTX Lowering

```asm
shfl.sync.down.b32 r1, r0, 16, 0xffffffff, 31;
add.f32 r0, r0, r1;

shfl.sync.down.b32 r1, r0, 8, 0xffffffff, 31;
add.f32 r0, r0, r1;
```

Equivalent CUDA intrinsic:

```c
__shfl_down_sync(mask, value, delta)
```

#### PTX Backend Tasks

Add lowering in:

- `PTXIntrinsicNode`
- `PTXCodeGenerator`

Mappings:

- `simdShuffleDown` → `shfl.sync.down`
- `simdSum` → warp reduction pattern

---

## Important Design Considerations

### 1. Warp/Subgroup Size

| Backend | Subgroup Size |
|---|---|
| Metal | `simdgroup_size` = 32 |
| CUDA | `warp_size` = 32 |
| OpenCL | `sub_group_size` = implementation-defined (8–64) |

Therefore `LOCAL_SIZE == subgroup_size` must be validated.

Add runtime query: `ctx.simdGroupSize()`

### 2. Active Lane Masks (Missing)

CUDA shuffle requires a mask:

```c
__shfl_sync(mask, val, lane)
```

PTX: `shfl.sync.* mask`

The current API assumes full warp participation. Consider adding:

```java
simdShuffleDown(float value, int delta, int mask)
```

Or use implicit `mask = full warp`.

### 3. Lane ID Access (Missing)

Useful primitive:

```java
int lane = ctx.simdLaneId()
```

| Backend | Instruction |
|---|---|
| Metal | `simd_lane_id` |
| CUDA | `%laneid` |
| OpenCL | `get_sub_group_local_id()` |

Needed for warp prefix sums, warp compaction, and ballot patterns.

### 4. Subgroup Size Query (Missing)

Add: `ctx.simdGroupSize()`

| Backend | Instruction |
|---|---|
| Metal | `simdgroup_size` |
| CUDA | `warpSize` |
| OpenCL | `get_sub_group_size()` |

### 5. Missing Shuffle Variants

Currently only `simdShuffleDown` is exposed. Most GPU APIs also provide:

| Operation | CUDA | OpenCL | Metal |
|---|---|---|---|
| shuffle index | `shfl_sync` | `sub_group_shuffle` | `simd_shuffle` |
| shuffle up | `shfl_up` | `sub_group_shuffle_up` | `simd_shuffle_up` |
| shuffle xor | `shfl_xor` | `sub_group_shuffle_xor` | `simd_shuffle_xor` |

Recommended API additions:

```java
simdShuffle(value, lane)
simdShuffleUp(value, delta)
simdShuffleDown(value, delta)
simdShuffleXor(value, mask)
```

### 6. SIMD Voting Operations (Missing)

Extremely useful for GPU algorithms:

| Operation | CUDA | OpenCL |
|---|---|---|
| any | `__any_sync` | `sub_group_any` |
| all | `__all_sync` | `sub_group_all` |
| ballot | `__ballot_sync` | `sub_group_ballot` |

Potential API:

```java
simdAny(boolean predicate)
simdAll(boolean predicate)
simdBallot(boolean predicate)
```

### 7. SIMD Prefix Operations

Useful for stream compaction, radix sort, and graph algorithms.

OpenCL already exposes:

- `sub_group_scan_inclusive_add`
- `sub_group_scan_exclusive_add`

Possible API:

```java
simdPrefixSum(value)
simdPrefixExclusive(value)
```

---

## Suggested Final API

Minimal portable API surface:

```java
// Reductions
float simdSum(float v);

// Shuffles
float simdShuffle(float v, int lane);
float simdShuffleDown(float v, int delta);
float simdShuffleUp(float v, int delta);
float simdShuffleXor(float v, int mask);

// Lane queries
int simdLaneId();
int simdGroupSize();

// Voting
boolean simdAny(boolean p);
boolean simdAll(boolean p);
int simdBallot(boolean p);
```

---

## Expected Performance Impact

For reductions and warp collectives:

| Method | Cost |
|---|---|
| Shared memory | ~50–100 cycles |
| Shuffle | ~5 cycles |

Benchmark results (~1.6× faster) align with the expected cycle-level advantage of register-to-register communication over shared memory paths.

TornadoVM	OpenCL
`simdSum(x)`	`sub_group_reduce_add(x)`
`simdShuffleDown(x, d)`	`sub_group_shuffle_down(x, d)`

Operation	CUDA	OpenCL	Metal
shuffle index	`shfl_sync`	`sub_group_shuffle`	`simd_shuffle`
shuffle up	`shfl_up`	`sub_group_shuffle_up`	`simd_shuffle_up`
shuffle xor	`shfl_xor`	`sub_group_shuffle_xor`	`simd_shuffle_xor`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Subgroup/Warp Shuffle Support to OpenCL and PTX Backends #812

Summary

Motivation

Proposed TornadoVM API (Already Implemented)

Required Backend Implementations

1. OpenCL Backend

Required Mappings

Example OpenCL Kernel

Backend Tasks

Required Capability Detection

2. PTX Backend

CUDA PTX Equivalent Instructions

Example PTX Lowering

PTX Backend Tasks

Important Design Considerations

1. Warp/Subgroup Size

2. Active Lane Masks (Missing)

3. Lane ID Access (Missing)

4. Subgroup Size Query (Missing)

5. Missing Shuffle Variants

6. SIMD Voting Operations (Missing)

7. SIMD Prefix Operations

Suggested Final API

Expected Performance Impact

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

TornadoVM	PTX
`simdShuffleDown`	`shfl.sync.down`
`simdSum`	warp reduction via shuffle

Backend	Subgroup Size
Metal	`simdgroup_size` = 32
CUDA	`warp_size` = 32
OpenCL	`sub_group_size` = implementation-defined (8–64)

Backend	Instruction
Metal	`simd_lane_id`
CUDA	`%laneid`
OpenCL	`get_sub_group_local_id()`

Backend	Instruction
Metal	`simdgroup_size`
CUDA	`warpSize`
OpenCL	`get_sub_group_size()`

Operation	CUDA	OpenCL
any	`__any_sync`	`sub_group_any`
all	`__all_sync`	`sub_group_all`
ballot	`__ballot_sync`	`sub_group_ballot`

Add Subgroup/Warp Shuffle Support to OpenCL and PTX Backends #812

Description

Summary

Motivation

Proposed TornadoVM API (Already Implemented)

Required Backend Implementations

1. OpenCL Backend

Required Mappings

Example OpenCL Kernel

Backend Tasks

Required Capability Detection

2. PTX Backend

CUDA PTX Equivalent Instructions

Example PTX Lowering

PTX Backend Tasks

Important Design Considerations

1. Warp/Subgroup Size

2. Active Lane Masks (Missing)

3. Lane ID Access (Missing)

4. Subgroup Size Query (Missing)

5. Missing Shuffle Variants

6. SIMD Voting Operations (Missing)

7. SIMD Prefix Operations

Suggested Final API

Expected Performance Impact

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions