Skip to content

Fix testSolutionStructsUtilities Unit Test#5350

Closed
talumbau wants to merge 71 commits intodevelopfrom
users/talumbau/unit_test_macdatatypeA_fix
Closed

Fix testSolutionStructsUtilities Unit Test#5350
talumbau wants to merge 71 commits intodevelopfrom
users/talumbau/unit_test_macdatatypeA_fix

Conversation

@talumbau
Copy link
Contributor

wen-des and others added 30 commits January 26, 2026 16:25
## Motivation

Standardize kernel names with MX types.

## Technical Details

Removes redundant underscores, and makes some minor adjustments to make
MX kernel names consistent with the rest of the library.

## Test Plan

Checked kernel names with the mx test.
## Motivation

Fix some errors in codegen that break non-mx tests.

## Technical Details

calcLdsPad was refactored to include LRVWA/B. Arg list and call points
needed to be updated to match.
bpeGR now contains a floating point value. Offsets need to be converted
to int before written to assembly since offsets cannot be floats.

## Test Plan

Only tested with stream-k unit test so far (non-mx). That one now
passes.
## Motivation

Enable DTL for gfx950 mxfp4

## Technical Details

- Fixed inconsistent conditions for MXSA/B vreg allocation and local
write code generation
- Temporarily disabled UseGeneralizedNLCOneA/B (causes issue with
MXSA/B)

## Test Plan

Single test with tensilelite yaml

## Test Result

Local test passed

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
…lling Tensile MX kernel (#4599)

## Motivation

This PR enables using mxDataGenerator when Tensile is the host and
supports calling FP4 kernels generated from Tensile.

## Technical Details

- Add a FP4 library (yaml) generated by Tensile under GridBased
category:
[YAML](https://github.com/ROCm/rocm-libraries/blob/1848898c099b127bf77f8306678150953f563648/projects/hipblaslt/library/src/amd_detail/rocblaslt/src/Tensile/Logic/asm_full/gfx950/GridBased/aquavanjaram_Cijk_Alik_Bljk___F_4_S_S___M_X_A_3_2___M_X_B_3_2___BH_Bias_S_HA_S_SAV_UserArgs.yaml)
- Remove macros to use mxDataGenerator regardless which host used. Now
**the default C++ standard is set to C++20** as it is required by
mxDataGenerator.
- Support calling Tensile FP4 solutions


## Test Plan

Use cmake preset build with rocRoller host **off** (i.e., use Tensile as
host) , gpu target set to gfx950 and `-DBUILD_TESTING:BOOL=OFF` (turn
off tensileLite test which will error out during build)

- Use `hipblaslt-test`
`./clients/hipblaslt-test --gtest_filter=*matmul_tensile_fp4*`

- Use `hipblaslt-bench`
`./clients/hipblaslt-bench --iters 0 --cold_iters 0 --transA T --transB
N --a_type f4_r --b_type f4_r --c_type f32_r --d_type f32_r -m 256 -n
256 -k 256 --alpha 2.1 --beta 0.7 --scaleA 3 --scaleB 3 --scale_type
f32_r --verify`

## Test Result


## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
…ion (#4702)

## Motivation

This PR updates the predicate to check the block size of mx data types.
If block size is not considered, wrong solution might be selected.

## Technical Details

- Add MX block size A and B into predicate and serialization

## Test Plan

Manually tried:

`./clients/hipblaslt-bench --iters 0 --cold_iters 0 --transA T --transB
N --a_type f8_r --b_type f8_r --c_type f32_r --d_type f32_r -m 256 -n
256 --alpha 2.1 --beta 0.7 --scaleA 3 --scaleB 3 --scale_type f32_r
--verify`

And no solution found (currently there is no solution for mx f8).

## Test Result


## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Motivation

Add MXFP4 data generator for Tensile

## Technical Details

Integrate Hipblaslt’s MXFP4 data generator into TensilteLite.

The goal is to decrease the amount of zero-valued data so that it better
resembles production workloads, thereby improving real‑time performance
accuracy.

After integration, zero frequency drops by 0.5%, and stays in the range
of ~12.5%-13%.

Note: Upgrades to C++20 as mxDataGenerator requires it.

## Test Plan

MXDataGen_test.cpp - verify the integration works & that the zero
frequency is below a fixed threshold.

## Test Result

TBD

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

AIHPBLAS-796
…tToLds (#4683)

## Motivation

Improve MXFP4 performance with DirectoLds

## Technical Details

- Enable DirectToLds for MXSA/B
- Re-enable LdsPad for MXFP4 + DirectToLds
- Added test cases to mx32f4_tn.yaml

## Test Plan

Added test case to tensililite yaml

## Test Result

Local test passed 

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Motivation

Builds didn't work if hipBLASLt wasn't already installed.

## Technical Details

Changed the include paths so MXDataGen didn't need hipBLASLT.hpp

## Test Plan

Build locally without hipBLASLt already installed.

## Test Result

Build worked.

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Motivation

Fix a verification fail with MXFP4 + non DTL

## Technical Details

Fixed incorrect waitcnt for prefetch local read with SubIter case
Also, fixed the location of second waitcnt for TN case (use MIWaveTileA
// 2 for SubIter=2 case)

## Test Plan

Local test

## Test Result

Local test passed

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
archana-ramalingam and others added 23 commits March 4, 2026 14:30
…#4959)

## Motivation

Write test to catch post-generation overwrites of `MXSA`/`MXSB` buffers
that desync the CPU reference from GPU data, causing intermittent MX FP4
validation failures.

## Technical Details

Added `MXScaleDeterminismTest/ScaleBufferIsDeterministic` to
`MXDataGen_test.cpp` - calls `generateMXInput` twice with
sentinel-initialized scale buffers (`0x00` and `0xFF`) and asserts
equality - the differing sentinels prevent a silent no-write from
passing.

Includes non-32-aligned cases (M=204, M=213) that failed due to
`initializeCPUInputs` overwriting `MXSA`/`MXSB` after `generateMXInput`
populated them.

## Test Plan

- `MXScaleDeterminismTest/ScaleBufferIsDeterministic` — 4 cases

## Test Result

- [x] All tests passing

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Summary

- When `gfx950_mx_rebase` split `forceLrvwTile1` into per-tensor
definitions (using `MacDataTypeA`/`MacDataTypeB` instead of `DataType`),
the B tensor copy lost the `not UseF32XEmulation` guard present on
`develop`
- This caused TF32 kernels to get `lrvwTileB=1` instead of
`VectorWidthB=2`, halving the local read block width and doubling the
number of LRB instructions (8 → 16), breaking CMS schedules
- One-line fix: add `and (not kernel["UseF32XEmulation"])` to the B
tensor `forceLrvwTile1` definition

## Test plan

- [x] `xfp32.yaml` passes
- [x] `cms_tf32_nt.yaml` passes (extracted from
`custom_mainloop_scheduling_tf32.yaml`)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

AIHPBLAS-1221
AIHPBLAS-1224

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…ment for DTL (#5117)

## Motivation

Enable MT256x256x256 for MXFP4. Need StoreSwapAddr and GRVWMXSA/B
adjustment for DTL

## Technical Details

- Enable StoreSwapAddr for MXFP4
- Added GRVWMXSA/B adjustment logic for DTL (64bit to 32bit)

## Test Plan

CI test

## Test Result

CI test

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
…5129)

## Motivation

Grouped GEMM with `UseUserArgs: True` crashes with
`hipErrorIllegalAddress` because the kernel's `userArgsInfo.totalSize`
doesn't match `sizeof(DeviceUserArguments)`.

## Technical Details

`DeviceUserArguments` is a fixed 196-byte packed struct, but
`userArgsInfo` field sizes in `Signature.py` were only set inside
feature-gated blocks, leaving them at 0 when features were disabled.

This made `totalSize` (stride between per-problem entries) and
`extArgOffset` (epilogue field offsets) smaller than the struct, causing
the kernel to read from wrong offsets for multi-problem grouped GEMMs.

Fix: move size assignments outside conditional blocks to always match
the struct layout.

## Test Plan

Existing test: `grouped_gemm_userargs.yaml` — validates grouped GEMM
with `UseUserArgs: True` across ScaleAlphaVec, Bias+Activation, and
plain configurations.

## Test Result

- [x] Test passed

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

AIHPBLAS-1220
## Motivation

Comment two problem sizes of the new mx f8 tests that currently fail.

## Technical Details



## Test Plan

Manually tested other problem sizes to verify they pass.

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Motivation

Improve MXFP4 instruction scheduling. Plus, reduce sgpr usage for
MXFP4+StreamK

## Technical Details

    - Added if condition for MXFP4 in getMFMAIssueLatency
    - Fixed numGRIncInst calculation for MXFP4
    - Use MinGRIncPerMfma=3 for MXFP4
    - Disable staggerU code for MXFP4 + StreamK

## Test Plan

CI test

## Test Result

Check CI test

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Motivation

Generate pre-shuffled scaling data that TensileLite can use for MX
datatypes

## Technical Details

Added optional `--enable-mx-preswizzle` flag to control MX FP4 scale
pre-swizzle.

When enabled, the flag allows scale tensors to be rearranged into gfx950
GPU kernel memory access patterns using parameters `{swizzleTileMN=32,
tileK=8, subTileK=4}` and `{tileK=8, swizzleTileMN=32}`.

Pre-swizzle requirements: flag enabled, solution available, `useScaleAB`
populated, and dimension alignment satisfied (rows%256==0, cols%32==0
for mxBlock=32).

## Test Plan

`MXPreSwizzleTest` validates scale permutation behavior and data buffer
invariance across multiple matrix dimensions and transpose modes.

## Test Result

Tests pass successfully

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

AIHPBLAS-789
## Motivation

Fix fail with kringshift.yaml on gfx950_mx_rebase branch

## Technical Details

Swapped the location of graAddresses and graFinalOffsets to align with
develop branch.
Before this fix, s[sgprSrdB+0] was referred before initialization in
kringshift case.

## Test Plan

Run kringshift.yaml

## Test Result

kringshift.yaml passed.

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Motivation

optimize sgpr allocation for StoreSwapAddr

## Technical Details

- use 1 common sgpr (swapCommon) for all (A,B,MXSA,MXSB)
- use s_add m0, swapCommon, LocalWriteAddr for m0 initialization

## Test Plan

Run existing dtl test

## Test Result

dtl.yaml passed

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Motivation

MXFP4 + DtlPlusLdsBuf was broken and need to enable it

## Technical Details

Added missing code for MXFP4 + DtlPlusLdsBuf

## Test Plan

Local test with MXFP4 + DtlPlusLdsBuf

## Test Result

Local test passed

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Motivation

Fix scheduling issues with MXFP4 + SIA3

## Technical Details

- fix incorrect local read count for MXSA/B in getLocalWriteMFMAStart
- fix incorrect global read count for MXSA/B in getNumLocalWritePerMfma
- optimize PointerLRCode scheduling for MXFP4

## Test Plan

tensilelite common test

## Test Result

tensilelite common test passed

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Motivation

This PR fixes failed gfx942 tests in CI.

## Technical Details

For fnuz/non-fnuz f8 types, the `ComputeInputType` might appear in a
concatenated form:`ComputeInputTypeAComputeInputTypeB` (e.g.,
`BFloat8Float8_fnuz`). This leads to type mismatches when selecting
solutions. This PR adds logic to correctly interpret concatenated types
by checking the type at the corresponding position.

## Test Plan

Manually run these tests:

gfx942:
- `./clients/hipblaslt-test
--gtest_filter=*pre_checkin_matmul_real_1b_fnuz_dst_1b_fnuz_smallsize_bf8_fnuz_rf8_fnuz_rbf8_fnuz_rbf8_fnuz_rf32_r_NT_3_128_128_1_3_128_2_3_3_1_SA_SB*`

gfx950:
- `./clients/hipblaslt-test --gtest_filter=*matmul_f8_dst_bf16*`
- `./clients/hipblaslt-test --gtest_filter=*matmul_f8_bf8_dst_bf16*`
- `./clients/hipblaslt-test --gtest_filter=*matmul_f8_bf8_dst_fp32*`


## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
…5273)

## Motivation
- The error:
[math-ci.amd.com/job/rocm-libraries/job/precheckin/job/hipsparselt/job/PR-5202/6/pipeline-overview/log?nodeId=628](https://math-ci.amd.com/job/rocm-libraries/job/precheckin/job/hipsparselt/job/PR-5202/6/pipeline-overview/log?nodeId=628)
- Fix the inconsistency between LocalRead.py and KernelWriterAssembly.py
- Change `setComputeInputType` to `setComputeInputTypeA` and
`setComputeInputTypeB`, because `setComputeInputType` doesn't exist
anymore.

<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->

## Technical Details

- Inside the LocalRead side, it tries to reference the wrong names,
which are non-set sgprs such as `sgprPackKForV2` and `sgprPackKForV3`.
It shall reference `sgprPackKForMV2/sgprPackKForMV3`, so I have to make
it reference the right ones.

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

- Please use the attached file `setup_hipsparselt_build` to set the
packages and env variables the same as the CI.

[setup_hipsparselt_build.sh](https://github.com/user-attachments/files/25870143/setup_hipsparselt_build.sh)
- `./install.sh -c`
<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

- first build with errors: 

[build_hipsparselt_1st.log](https://github.com/user-attachments/files/25870212/build_hipsparselt_1st.log)

- final build WITHOUT errors: 

[build_hipsparselt_final.log](https://github.com/user-attachments/files/25870348/build_hipsparselt_final.log)


<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Co-authored-by: tomchengchitang <tom.tang@amd.com>
- Add MacDataTypeA to mock kernel since it is required for getting the
  MI data type now from the utility function.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.