-
Notifications
You must be signed in to change notification settings - Fork 130
Description
Summary
TornadoVM currently exposes SIMD subgroup operations through KernelContext, enabling efficient warp-level reductions using Metal's simd_sum and simd_shuffle_down.
Example (already working in Metal backend #796 ):
float val = input.get(globalIdx);
val += ctx.simdShuffleDown(val, 16);
val += ctx.simdShuffleDown(val, 8);
val += ctx.simdShuffleDown(val, 4);
val += ctx.simdShuffleDown(val, 2);
val += ctx.simdShuffleDown(val, 1);These operations map to Metal simdgroup intrinsics and allow register-level communication across lanes, avoiding shared/threadgroup memory.
Goal: Implement equivalent functionality in the OpenCL and PTX backends.
Motivation
Warp/subgroup primitives are essential for:
- Reductions
- Scans
- Warp-aggregated atomics
- Attention kernels
- Matrix tiling
- Tensor core pipelines
They eliminate shared memory traffic and synchronization barriers.
Benchmark results show ~1.6× speedup vs threadgroup memory reductions.
Proposed TornadoVM API (Already Implemented)
Current exposed primitives:
float simdSum(float value)
float simdShuffleDown(float value, int delta)Used via KernelContext ctx.
Required Backend Implementations
1. OpenCL Backend
OpenCL provides equivalent functionality through subgroup operations, available via the cl_khr_subgroups extension or OpenCL 3.0 core.
Required Mappings
| TornadoVM | OpenCL |
|---|---|
simdSum(x) |
sub_group_reduce_add(x) |
simdShuffleDown(x, d) |
sub_group_shuffle_down(x, d) |
Example OpenCL Kernel
float val = input[gid];
val += sub_group_shuffle_down(val, 16);
val += sub_group_shuffle_down(val, 8);
val += sub_group_shuffle_down(val, 4);
val += sub_group_shuffle_down(val, 2);
val += sub_group_shuffle_down(val, 1);OpenCL subgroup extensions allow data exchange between work-items in a subgroup without using local memory, improving performance compared to manual reductions.
Backend Tasks
Implement in:
OpenCLIntrinsicNodeOpenCLCodeGenerator
Add mappings:
KernelContext.simdShuffleDown→sub_group_shuffle_downKernelContext.simdSum→sub_group_reduce_add
Required Capability Detection
At runtime, query:
CL_DEVICE_SUB_GROUP_SIZES_INTELCL_DEVICE_MAX_NUM_SUB_GROUPSCL_DEVICE_SUB_GROUP_INDEPENDENT_FORWARD_PROGRESS
Or check for:
__opencl_c_subgroupscl_khr_subgroups
2. PTX Backend
CUDA PTX Equivalent Instructions
| TornadoVM | PTX |
|---|---|
simdShuffleDown |
shfl.sync.down |
simdSum |
warp reduction via shuffle |
Example PTX Lowering
shfl.sync.down.b32 r1, r0, 16, 0xffffffff, 31;
add.f32 r0, r0, r1;
shfl.sync.down.b32 r1, r0, 8, 0xffffffff, 31;
add.f32 r0, r0, r1;Equivalent CUDA intrinsic:
__shfl_down_sync(mask, value, delta)PTX Backend Tasks
Add lowering in:
PTXIntrinsicNodePTXCodeGenerator
Mappings:
simdShuffleDown→shfl.sync.downsimdSum→ warp reduction pattern
Important Design Considerations
1. Warp/Subgroup Size
| Backend | Subgroup Size |
|---|---|
| Metal | simdgroup_size = 32 |
| CUDA | warp_size = 32 |
| OpenCL | sub_group_size = implementation-defined (8–64) |
Therefore LOCAL_SIZE == subgroup_size must be validated.
Add runtime query: ctx.simdGroupSize()
2. Active Lane Masks (Missing)
CUDA shuffle requires a mask:
__shfl_sync(mask, val, lane)PTX: shfl.sync.* mask
The current API assumes full warp participation. Consider adding:
simdShuffleDown(float value, int delta, int mask)Or use implicit mask = full warp.
3. Lane ID Access (Missing)
Useful primitive:
int lane = ctx.simdLaneId()| Backend | Instruction |
|---|---|
| Metal | simd_lane_id |
| CUDA | %laneid |
| OpenCL | get_sub_group_local_id() |
Needed for warp prefix sums, warp compaction, and ballot patterns.
4. Subgroup Size Query (Missing)
Add: ctx.simdGroupSize()
| Backend | Instruction |
|---|---|
| Metal | simdgroup_size |
| CUDA | warpSize |
| OpenCL | get_sub_group_size() |
5. Missing Shuffle Variants
Currently only simdShuffleDown is exposed. Most GPU APIs also provide:
| Operation | CUDA | OpenCL | Metal |
|---|---|---|---|
| shuffle index | shfl_sync |
sub_group_shuffle |
simd_shuffle |
| shuffle up | shfl_up |
sub_group_shuffle_up |
simd_shuffle_up |
| shuffle xor | shfl_xor |
sub_group_shuffle_xor |
simd_shuffle_xor |
Recommended API additions:
simdShuffle(value, lane)
simdShuffleUp(value, delta)
simdShuffleDown(value, delta)
simdShuffleXor(value, mask)6. SIMD Voting Operations (Missing)
Extremely useful for GPU algorithms:
| Operation | CUDA | OpenCL |
|---|---|---|
| any | __any_sync |
sub_group_any |
| all | __all_sync |
sub_group_all |
| ballot | __ballot_sync |
sub_group_ballot |
Potential API:
simdAny(boolean predicate)
simdAll(boolean predicate)
simdBallot(boolean predicate)7. SIMD Prefix Operations
Useful for stream compaction, radix sort, and graph algorithms.
OpenCL already exposes:
sub_group_scan_inclusive_addsub_group_scan_exclusive_add
Possible API:
simdPrefixSum(value)
simdPrefixExclusive(value)Suggested Final API
Minimal portable API surface:
// Reductions
float simdSum(float v);
// Shuffles
float simdShuffle(float v, int lane);
float simdShuffleDown(float v, int delta);
float simdShuffleUp(float v, int delta);
float simdShuffleXor(float v, int mask);
// Lane queries
int simdLaneId();
int simdGroupSize();
// Voting
boolean simdAny(boolean p);
boolean simdAll(boolean p);
int simdBallot(boolean p);Expected Performance Impact
For reductions and warp collectives:
| Method | Cost |
|---|---|
| Shared memory | ~50–100 cycles |
| Shuffle | ~5 cycles |
Benchmark results (~1.6× faster) align with the expected cycle-level advantage of register-to-register communication over shared memory paths.