Kernel perf regression using latest main branch with configs autotuned from v1.0.0

### Issue
I observed perf regression for a [kernel](https://github.com/vllm-project/vllm/blob/03f8a48aea181680f6ef5d6c9d0be8a6bb4912ae/vllm/kernels/helion/ops/scaled_mm.py#L158) I am working on after switching to latest helion main branch. The configs used are autotuned using official helion 1.0.0 version. This regression could apply to other Helion users from the root cause below.

### Root cause
The config used for this testing [kernel](https://github.com/vllm-project/vllm/blob/03f8a48aea181680f6ef5d6c9d0be8a6bb4912ae/vllm/kernels/helion/ops/scaled_mm.py#L158) is 
```
{
    "block_sizes": [
      8,
      16,
      256
    ],
    "loop_orders": [
      [
        1,
        0
      ]
    ],
    "l2_groupings": [
      4
    ],
    "range_unroll_factors": [
      0,
      2
    ],
    "range_warp_specializes": [],
    "range_num_stages": [],
    "range_multi_buffers": [
      null,
      null
    ],
    "range_flattens": [
      null,
      true
    ],
    "load_eviction_policies": [
      "first",
      "last",
      "last",
      "first",
      "last",
      "last",
      "last"
    ],
    "num_warps": 2,
    "num_stages": 4,
    "indexing": [
      "tensor_descriptor",
      "pointer",
      "tensor_descriptor",
      "tensor_descriptor",
      "tensor_descriptor",
      "tensor_descriptor",
      "tensor_descriptor",
      "pointer"
    ],
    "pid_type": "flat"
  }
```
Diff between lowered triton code from the 2 helion versions:
```
- for offset_2 in tl.range(0, 4096, _BLOCK_SIZE_2, loop_unroll_factor=2, flatten=True): # from 1.0.0
+ for offset_2 in tl.range(0, 4096, _BLOCK_SIZE_2, flatten=True): # from latest main
```
So essentially, the ```range_unroll_factors=[0, 2]``` was not respected in the latest main. This points to a recent change in this if [condition](https://github.com/pytorch/helion/blob/4292b2a0e2c6756fa0a0ef0b8679e92bd13559e0/helion/_compiler/tile_strategy.py#L272-L278), where
```
if config.indexing == "tensor_descriptor":
```
is changed to
```
if "tensor_descriptor" in config.indexing:
```

Reverting this line brings perf num to the original level.


### Impact
This essentially means Helion user has to rerun autotune to update their configs to avoid the potential regression.

On B200, I also noticed config autotuned from helion 1.0.0 will fail the kernel execution if switched to latest main. Will confirm if it is related to this change and provide more data point and example to reproduce it.

### Action items
The [commit](https://github.com/pytorch/helion/commit/4292b2a0e2c6756fa0a0ef0b8679e92bd13559e0#diff-077cbec3a4e27038c21b68b401a68f3fcf1227e75d00fa666064f321973052cf) causing this regression is actually legitimate. It fixes the original code that aims to skip the configs that potentially leads to CUDA misaligned address. I also experienced this CUDA misaligned address on B200 when autotuning scaled_mm kernel and including this commit avoids the issue.

So, it is probably unreasonable to revert this commit. I would suggest to support autotuner to recover from unrecoverable CUDA errors. I see some ongoing work [here](https://github.com/pytorch/helion/pull/1921) already.  

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Kernel perf regression using latest main branch with configs autotuned from v1.0.0 #2044

Issue

Root cause

Impact

Action items

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Kernel perf regression using latest main branch with configs autotuned from v1.0.0 #2044

Description

Issue

Root cause

Impact

Action items

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions