Skip to content

Add Subgroup/Warp Shuffle Support to OpenCL and PTX Backends #812

@mikepapadim

Description

@mikepapadim

Summary

TornadoVM currently exposes SIMD subgroup operations through KernelContext, enabling efficient warp-level reductions using Metal's simd_sum and simd_shuffle_down.

Example (already working in Metal backend #796 ):

float val = input.get(globalIdx);
val += ctx.simdShuffleDown(val, 16);
val += ctx.simdShuffleDown(val, 8);
val += ctx.simdShuffleDown(val, 4);
val += ctx.simdShuffleDown(val, 2);
val += ctx.simdShuffleDown(val, 1);

These operations map to Metal simdgroup intrinsics and allow register-level communication across lanes, avoiding shared/threadgroup memory.

Goal: Implement equivalent functionality in the OpenCL and PTX backends.


Motivation

Warp/subgroup primitives are essential for:

  • Reductions
  • Scans
  • Warp-aggregated atomics
  • Attention kernels
  • Matrix tiling
  • Tensor core pipelines

They eliminate shared memory traffic and synchronization barriers.

Benchmark results show ~1.6× speedup vs threadgroup memory reductions.


Proposed TornadoVM API (Already Implemented)

Current exposed primitives:

float simdSum(float value)
float simdShuffleDown(float value, int delta)

Used via KernelContext ctx.


Required Backend Implementations

1. OpenCL Backend

OpenCL provides equivalent functionality through subgroup operations, available via the cl_khr_subgroups extension or OpenCL 3.0 core.

Required Mappings

TornadoVM OpenCL
simdSum(x) sub_group_reduce_add(x)
simdShuffleDown(x, d) sub_group_shuffle_down(x, d)

Example OpenCL Kernel

float val = input[gid];

val += sub_group_shuffle_down(val, 16);
val += sub_group_shuffle_down(val, 8);
val += sub_group_shuffle_down(val, 4);
val += sub_group_shuffle_down(val, 2);
val += sub_group_shuffle_down(val, 1);

OpenCL subgroup extensions allow data exchange between work-items in a subgroup without using local memory, improving performance compared to manual reductions.

Backend Tasks

Implement in:

  • OpenCLIntrinsicNode
  • OpenCLCodeGenerator

Add mappings:

  • KernelContext.simdShuffleDownsub_group_shuffle_down
  • KernelContext.simdSumsub_group_reduce_add

Required Capability Detection

At runtime, query:

  • CL_DEVICE_SUB_GROUP_SIZES_INTEL
  • CL_DEVICE_MAX_NUM_SUB_GROUPS
  • CL_DEVICE_SUB_GROUP_INDEPENDENT_FORWARD_PROGRESS

Or check for:

  • __opencl_c_subgroups
  • cl_khr_subgroups

2. PTX Backend

CUDA PTX Equivalent Instructions

TornadoVM PTX
simdShuffleDown shfl.sync.down
simdSum warp reduction via shuffle

Example PTX Lowering

shfl.sync.down.b32 r1, r0, 16, 0xffffffff, 31;
add.f32 r0, r0, r1;

shfl.sync.down.b32 r1, r0, 8, 0xffffffff, 31;
add.f32 r0, r0, r1;

Equivalent CUDA intrinsic:

__shfl_down_sync(mask, value, delta)

PTX Backend Tasks

Add lowering in:

  • PTXIntrinsicNode
  • PTXCodeGenerator

Mappings:

  • simdShuffleDownshfl.sync.down
  • simdSum → warp reduction pattern

Important Design Considerations

1. Warp/Subgroup Size

Backend Subgroup Size
Metal simdgroup_size = 32
CUDA warp_size = 32
OpenCL sub_group_size = implementation-defined (8–64)

Therefore LOCAL_SIZE == subgroup_size must be validated.

Add runtime query: ctx.simdGroupSize()

2. Active Lane Masks (Missing)

CUDA shuffle requires a mask:

__shfl_sync(mask, val, lane)

PTX: shfl.sync.* mask

The current API assumes full warp participation. Consider adding:

simdShuffleDown(float value, int delta, int mask)

Or use implicit mask = full warp.

3. Lane ID Access (Missing)

Useful primitive:

int lane = ctx.simdLaneId()
Backend Instruction
Metal simd_lane_id
CUDA %laneid
OpenCL get_sub_group_local_id()

Needed for warp prefix sums, warp compaction, and ballot patterns.

4. Subgroup Size Query (Missing)

Add: ctx.simdGroupSize()

Backend Instruction
Metal simdgroup_size
CUDA warpSize
OpenCL get_sub_group_size()

5. Missing Shuffle Variants

Currently only simdShuffleDown is exposed. Most GPU APIs also provide:

Operation CUDA OpenCL Metal
shuffle index shfl_sync sub_group_shuffle simd_shuffle
shuffle up shfl_up sub_group_shuffle_up simd_shuffle_up
shuffle xor shfl_xor sub_group_shuffle_xor simd_shuffle_xor

Recommended API additions:

simdShuffle(value, lane)
simdShuffleUp(value, delta)
simdShuffleDown(value, delta)
simdShuffleXor(value, mask)

6. SIMD Voting Operations (Missing)

Extremely useful for GPU algorithms:

Operation CUDA OpenCL
any __any_sync sub_group_any
all __all_sync sub_group_all
ballot __ballot_sync sub_group_ballot

Potential API:

simdAny(boolean predicate)
simdAll(boolean predicate)
simdBallot(boolean predicate)

7. SIMD Prefix Operations

Useful for stream compaction, radix sort, and graph algorithms.

OpenCL already exposes:

  • sub_group_scan_inclusive_add
  • sub_group_scan_exclusive_add

Possible API:

simdPrefixSum(value)
simdPrefixExclusive(value)

Suggested Final API

Minimal portable API surface:

// Reductions
float simdSum(float v);

// Shuffles
float simdShuffle(float v, int lane);
float simdShuffleDown(float v, int delta);
float simdShuffleUp(float v, int delta);
float simdShuffleXor(float v, int mask);

// Lane queries
int simdLaneId();
int simdGroupSize();

// Voting
boolean simdAny(boolean p);
boolean simdAll(boolean p);
int simdBallot(boolean p);

Expected Performance Impact

For reductions and warp collectives:

Method Cost
Shared memory ~50–100 cycles
Shuffle ~5 cycles

Benchmark results (~1.6× faster) align with the expected cycle-level advantage of register-to-register communication over shared memory paths.

Metadata

Metadata

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions