Skip to content

Kernel perf regression using latest main branch with configs autotuned from v1.0.0 #2044

Description

@xiaohongchen1991

Issue

I observed perf regression for a kernel I am working on after switching to latest helion main branch. The configs used are autotuned using official helion 1.0.0 version. This regression could apply to other Helion users from the root cause below.

Root cause

The config used for this testing kernel is

{
    "block_sizes": [
      8,
      16,
      256
    ],
    "loop_orders": [
      [
        1,
        0
      ]
    ],
    "l2_groupings": [
      4
    ],
    "range_unroll_factors": [
      0,
      2
    ],
    "range_warp_specializes": [],
    "range_num_stages": [],
    "range_multi_buffers": [
      null,
      null
    ],
    "range_flattens": [
      null,
      true
    ],
    "load_eviction_policies": [
      "first",
      "last",
      "last",
      "first",
      "last",
      "last",
      "last"
    ],
    "num_warps": 2,
    "num_stages": 4,
    "indexing": [
      "tensor_descriptor",
      "pointer",
      "tensor_descriptor",
      "tensor_descriptor",
      "tensor_descriptor",
      "tensor_descriptor",
      "tensor_descriptor",
      "pointer"
    ],
    "pid_type": "flat"
  }

Diff between lowered triton code from the 2 helion versions:

- for offset_2 in tl.range(0, 4096, _BLOCK_SIZE_2, loop_unroll_factor=2, flatten=True): # from 1.0.0
+ for offset_2 in tl.range(0, 4096, _BLOCK_SIZE_2, flatten=True): # from latest main

So essentially, the range_unroll_factors=[0, 2] was not respected in the latest main. This points to a recent change in this if condition, where

if config.indexing == "tensor_descriptor":

is changed to

if "tensor_descriptor" in config.indexing:

Reverting this line brings perf num to the original level.

Impact

This essentially means Helion user has to rerun autotune to update their configs to avoid the potential regression.

On B200, I also noticed config autotuned from helion 1.0.0 will fail the kernel execution if switched to latest main. Will confirm if it is related to this change and provide more data point and example to reproduce it.

Action items

The commit causing this regression is actually legitimate. It fixes the original code that aims to skip the configs that potentially leads to CUDA misaligned address. I also experienced this CUDA misaligned address on B200 when autotuning scaled_mm kernel and including this commit avoids the issue.

So, it is probably unreasonable to revert this commit. I would suggest to support autotuner to recover from unrecoverable CUDA errors. I see some ongoing work here already.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions