Issue
I observed perf regression for a kernel I am working on after switching to latest helion main branch. The configs used are autotuned using official helion 1.0.0 version. This regression could apply to other Helion users from the root cause below.
Root cause
The config used for this testing kernel is
{
"block_sizes": [
8,
16,
256
],
"loop_orders": [
[
1,
0
]
],
"l2_groupings": [
4
],
"range_unroll_factors": [
0,
2
],
"range_warp_specializes": [],
"range_num_stages": [],
"range_multi_buffers": [
null,
null
],
"range_flattens": [
null,
true
],
"load_eviction_policies": [
"first",
"last",
"last",
"first",
"last",
"last",
"last"
],
"num_warps": 2,
"num_stages": 4,
"indexing": [
"tensor_descriptor",
"pointer",
"tensor_descriptor",
"tensor_descriptor",
"tensor_descriptor",
"tensor_descriptor",
"tensor_descriptor",
"pointer"
],
"pid_type": "flat"
}
Diff between lowered triton code from the 2 helion versions:
- for offset_2 in tl.range(0, 4096, _BLOCK_SIZE_2, loop_unroll_factor=2, flatten=True): # from 1.0.0
+ for offset_2 in tl.range(0, 4096, _BLOCK_SIZE_2, flatten=True): # from latest main
So essentially, the range_unroll_factors=[0, 2] was not respected in the latest main. This points to a recent change in this if condition, where
if config.indexing == "tensor_descriptor":
is changed to
if "tensor_descriptor" in config.indexing:
Reverting this line brings perf num to the original level.
Impact
This essentially means Helion user has to rerun autotune to update their configs to avoid the potential regression.
On B200, I also noticed config autotuned from helion 1.0.0 will fail the kernel execution if switched to latest main. Will confirm if it is related to this change and provide more data point and example to reproduce it.
Action items
The commit causing this regression is actually legitimate. It fixes the original code that aims to skip the configs that potentially leads to CUDA misaligned address. I also experienced this CUDA misaligned address on B200 when autotuning scaled_mm kernel and including this commit avoids the issue.
So, it is probably unreasonable to revert this commit. I would suggest to support autotuner to recover from unrecoverable CUDA errors. I see some ongoing work here already.
Issue
I observed perf regression for a kernel I am working on after switching to latest helion main branch. The configs used are autotuned using official helion 1.0.0 version. This regression could apply to other Helion users from the root cause below.
Root cause
The config used for this testing kernel is
Diff between lowered triton code from the 2 helion versions:
So essentially, the
range_unroll_factors=[0, 2]was not respected in the latest main. This points to a recent change in this if condition, whereis changed to
Reverting this line brings perf num to the original level.
Impact
This essentially means Helion user has to rerun autotune to update their configs to avoid the potential regression.
On B200, I also noticed config autotuned from helion 1.0.0 will fail the kernel execution if switched to latest main. Will confirm if it is related to this change and provide more data point and example to reproduce it.
Action items
The commit causing this regression is actually legitimate. It fixes the original code that aims to skip the configs that potentially leads to CUDA misaligned address. I also experienced this CUDA misaligned address on B200 when autotuning scaled_mm kernel and including this commit avoids the issue.
So, it is probably unreasonable to revert this commit. I would suggest to support autotuner to recover from unrecoverable CUDA errors. I see some ongoing work here already.