Fix testSolutionStructsUtilities Unit Test by talumbau · Pull Request #5350 · ROCm/rocm-libraries

talumbau · 2026-03-11T18:34:25Z

Solution and kernel writer rebased
Contraction and DataType rebase
client modification
Parameters and LocalRead rebased
rocisa supportand other changes
Components rebased
writer related change and yaml
Conversion rebased
Fixed errors in compiling
Fixed python space issues
Bugfixed in python files and generated kernel sucessfully
Disable swap address for mxsa/mxsb
Committed some missing fixes
Fixed mxsa/mxsb address offset
Added TODO memo for later consideration
bpe function fix
Bugfixed for the wrong address offset calculation
MX F8 functional testes passed in tensilelite
Updated f8 yaml file
Removed the mx f6 yaml files for mx f6 is not ready by now
Updated f4 yaml file for test coverage
Standardize kernel names with MX types (Standardize kernel names with MX types #4363)
Fix some errors breaking non-mx tests on mx branch (Fix some errors breaking non-mx tests on mx branch #4616)
Fix for gfx950 mxfp4 DirectToLds (Fix for gfx950 mxfp4 DirectToLds #4644)
[hipBLASLt] Enable MX data generation for Tensile host and support calling Tensile MX kernel ([hipBLASLt] Enable MX data generation for Tensile host and support calling Tensile MX kernel #4599)
[hipBLASLt] Add block size into predicate for correct solution selection ([hipBLASLt] Add block size into predicate for correct solution selection #4702)
[Tensilelite] Add MXFP4 data generator for Tensile ([Tensilelite] Add MXFP4 data generator for Tensile #4597)
Enable DirectToLds for MXSA/B and re-enable LdsPad for MXFP4 + DirectToLds ( Enable DirectToLds for MXSA/B and re-enable LdsPad for MXFP4 + DirectToLds #4683)
Fix data initialization (Fix data initialization #4827)
Fix a verification fail with MXFP4 + non DTL (Fix a verification fail with MXFP4 + non DTL #4715)
[hipblaslt] Fixing build issues for gfx_950_mx_rebase ([hipblaslt] Fixing build issues for gfx_950_mx_rebase #4465)
[TensileLite] Fix MX FP4 scale data overwrite in initializeCPUInputs ([TensileLite] Fix MX FP4 scale data initialization & argument scoping #4917)
Fix stream-k with mx scaling (Fix stream-k with mx scaling #4388)
[hipblaslt] Fix fails with dtl.yaml and xfp32.yaml on gfx950_mx_rebase ([hipblaslt] Fix fails with dtl.yaml and xfp32.yaml on gfx950_mx_rebase #4906)
add kernel["ProblemType"]["Sparse"] to condition
fix dependency issues for tensilelite clients
fix computeInputType in tensilelite
fix computeInputType issue in ReferenceValidator.cpp
[hipblaslt] fix unit tests for gfx950_mx_rebase ([hipblaslt] fix unit tests for gfx950_mx_rebase #4912)
[hipblaslt] Fix a verification fail with spmm_i8hs.yaml ([hipblaslt] Fix a verification fail with spmm_i8hs.yaml #5034)
initial set of testcase for MXFP4 (initial set of testcase for MXFP4 #4739)
[Tensilelite] Add regression test for MX FP4 scale buffer determinism ([Tensilelite] Add regression test for MX FP4 scale buffer determinism #4959)
UseF32XEmulation in forceLrvwTile1 for B tensor (Fix bad merge: restore UseF32XEmulation in forceLrvwTile1 for B tensor #5143)
[hipblaslt] Enable StoreSwapAddr for MXFP4, plus add GRVWMXSA/B adustment for DTL ([hipblaslt] Enable StoreSwapAddr for MXFP4, plus add GRVWMXSA/B adustment for DTL #5117)
[Tensilelite] Fix UserArgs struct stride mismatch in grouped GEMM ( [Tensilelite] Fix UserArgs struct stride mismatch in grouped GEMM #5129)
[hipBLASLt] Disable failed mx f8 problem sizes ([hipBLASLt] Disable failed mx f8 problem sizes #5105)
[hipblaslt] Scheduling related fixes for MXFP4 ([hipblaslt] Scheduling related fixes for MXFP4 #5169)
remove explicit constructor from variable_value class
fix return statement in hipDataType_to_tensile_type and add type check for activation inputs
[Tensilelite] Shuffle mx scaling data in Tensile ([Tensilelite] Shuffle mx scaling data in Tensile #4864)
[hipblaslt] Fix fail with kringshift.yaml ([hipblaslt] Fix fail with kringshift.yaml #5228)
[hipblaslt] Optimize StoreSwapAddr ([hipblaslt] Optimize StoreSwapAddr #5217)
[hipblaslt] Enable MXFP4 + DtlPlusLdsBuf ([hipblaslt] Enable MXFP4 + DtlPlusLdsBuf #5251)
Fix gfx12 build error with integer cast
[hipblaslt] Fix SIA3 issues with MXFP4 ([hipblaslt] Fix SIA3 issues with MXFP4 #5245)
[hipBLASLt] Fix CI failures for gfx942 ([hipBLASLt] Fix CI failures for gfx942 #5216)
Make the usage side’s logic consistent with allocation side (tPackM) (Make the usage side’s logic consistent with allocation side (tPackM) #5273)
Fix test_SolutionStructsUtilities

## Motivation Standardize kernel names with MX types. ## Technical Details Removes redundant underscores, and makes some minor adjustments to make MX kernel names consistent with the rest of the library. ## Test Plan Checked kernel names with the mx test.

## Motivation Fix some errors in codegen that break non-mx tests. ## Technical Details calcLdsPad was refactored to include LRVWA/B. Arg list and call points needed to be updated to match. bpeGR now contains a floating point value. Offsets need to be converted to int before written to assembly since offsets cannot be floats. ## Test Plan Only tested with stream-k unit test so far (non-mx). That one now passes.

## Motivation Enable DTL for gfx950 mxfp4 ## Technical Details - Fixed inconsistent conditions for MXSA/B vreg allocation and local write code generation - Temporarily disabled UseGeneralizedNLCOneA/B (causes issue with MXSA/B) ## Test Plan Single test with tensilelite yaml ## Test Result Local test passed ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

…lling Tensile MX kernel (#4599) ## Motivation This PR enables using mxDataGenerator when Tensile is the host and supports calling FP4 kernels generated from Tensile. ## Technical Details - Add a FP4 library (yaml) generated by Tensile under GridBased category: [YAML](https://github.com/ROCm/rocm-libraries/blob/1848898c099b127bf77f8306678150953f563648/projects/hipblaslt/library/src/amd_detail/rocblaslt/src/Tensile/Logic/asm_full/gfx950/GridBased/aquavanjaram_Cijk_Alik_Bljk___F_4_S_S___M_X_A_3_2___M_X_B_3_2___BH_Bias_S_HA_S_SAV_UserArgs.yaml) - Remove macros to use mxDataGenerator regardless which host used. Now **the default C++ standard is set to C++20** as it is required by mxDataGenerator. - Support calling Tensile FP4 solutions ## Test Plan Use cmake preset build with rocRoller host **off** (i.e., use Tensile as host) , gpu target set to gfx950 and `-DBUILD_TESTING:BOOL=OFF` (turn off tensileLite test which will error out during build) - Use `hipblaslt-test` `./clients/hipblaslt-test --gtest_filter=*matmul_tensile_fp4*` - Use `hipblaslt-bench` `./clients/hipblaslt-bench --iters 0 --cold_iters 0 --transA T --transB N --a_type f4_r --b_type f4_r --c_type f32_r --d_type f32_r -m 256 -n 256 -k 256 --alpha 2.1 --beta 0.7 --scaleA 3 --scaleB 3 --scale_type f32_r --verify` ## Test Result ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

…ion (#4702) ## Motivation This PR updates the predicate to check the block size of mx data types. If block size is not considered, wrong solution might be selected. ## Technical Details - Add MX block size A and B into predicate and serialization ## Test Plan Manually tried: `./clients/hipblaslt-bench --iters 0 --cold_iters 0 --transA T --transB N --a_type f8_r --b_type f8_r --c_type f32_r --d_type f32_r -m 256 -n 256 --alpha 2.1 --beta 0.7 --scaleA 3 --scaleB 3 --scale_type f32_r --verify` And no solution found (currently there is no solution for mx f8). ## Test Result ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

## Motivation Add MXFP4 data generator for Tensile ## Technical Details Integrate Hipblaslt’s MXFP4 data generator into TensilteLite. The goal is to decrease the amount of zero-valued data so that it better resembles production workloads, thereby improving real‑time performance accuracy. After integration, zero frequency drops by 0.5%, and stays in the range of ~12.5%-13%. Note: Upgrades to C++20 as mxDataGenerator requires it. ## Test Plan MXDataGen_test.cpp - verify the integration works & that the zero frequency is below a fixed threshold. ## Test Result TBD ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. AIHPBLAS-796

…tToLds (#4683) ## Motivation Improve MXFP4 performance with DirectoLds ## Technical Details - Enable DirectToLds for MXSA/B - Re-enable LdsPad for MXFP4 + DirectToLds - Added test cases to mx32f4_tn.yaml ## Test Plan Added test case to tensililite yaml ## Test Result Local test passed ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

## Motivation Builds didn't work if hipBLASLt wasn't already installed. ## Technical Details Changed the include paths so MXDataGen didn't need hipBLASLT.hpp ## Test Plan Build locally without hipBLASLt already installed. ## Test Result Build worked. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

## Motivation Fix a verification fail with MXFP4 + non DTL ## Technical Details Fixed incorrect waitcnt for prefetch local read with SubIter case Also, fixed the location of second waitcnt for TN case (use MIWaveTileA // 2 for SubIter=2 case) ## Test Plan Local test ## Test Result Local test passed ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

…#4959) ## Motivation Write test to catch post-generation overwrites of `MXSA`/`MXSB` buffers that desync the CPU reference from GPU data, causing intermittent MX FP4 validation failures. ## Technical Details Added `MXScaleDeterminismTest/ScaleBufferIsDeterministic` to `MXDataGen_test.cpp` - calls `generateMXInput` twice with sentinel-initialized scale buffers (`0x00` and `0xFF`) and asserts equality - the differing sentinels prevent a silent no-write from passing. Includes non-32-aligned cases (M=204, M=213) that failed due to `initializeCPUInputs` overwriting `MXSA`/`MXSB` after `generateMXInput` populated them. ## Test Plan - `MXScaleDeterminismTest/ScaleBufferIsDeterministic` — 4 cases ## Test Result - [x] All tests passing ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

…sync

## Summary - When `gfx950_mx_rebase` split `forceLrvwTile1` into per-tensor definitions (using `MacDataTypeA`/`MacDataTypeB` instead of `DataType`), the B tensor copy lost the `not UseF32XEmulation` guard present on `develop` - This caused TF32 kernels to get `lrvwTileB=1` instead of `VectorWidthB=2`, halving the local read block width and doubling the number of LRB instructions (8 → 16), breaking CMS schedules - One-line fix: add `and (not kernel["UseF32XEmulation"])` to the B tensor `forceLrvwTile1` definition ## Test plan - [x] `xfp32.yaml` passes - [x] `cms_tf32_nt.yaml` passes (extracted from `custom_mainloop_scheduling_tf32.yaml`) 🤖 Generated with [Claude Code](https://claude.com/claude-code) AIHPBLAS-1221 AIHPBLAS-1224 Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

…base

…ment for DTL (#5117) ## Motivation Enable MT256x256x256 for MXFP4. Need StoreSwapAddr and GRVWMXSA/B adjustment for DTL ## Technical Details - Enable StoreSwapAddr for MXFP4 - Added GRVWMXSA/B adjustment logic for DTL (64bit to 32bit) ## Test Plan CI test ## Test Result CI test ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

…5129) ## Motivation Grouped GEMM with `UseUserArgs: True` crashes with `hipErrorIllegalAddress` because the kernel's `userArgsInfo.totalSize` doesn't match `sizeof(DeviceUserArguments)`. ## Technical Details `DeviceUserArguments` is a fixed 196-byte packed struct, but `userArgsInfo` field sizes in `Signature.py` were only set inside feature-gated blocks, leaving them at 0 when features were disabled. This made `totalSize` (stride between per-problem entries) and `extArgOffset` (epilogue field offsets) smaller than the struct, causing the kernel to read from wrong offsets for multi-problem grouped GEMMs. Fix: move size assignments outside conditional blocks to always match the struct layout. ## Test Plan Existing test: `grouped_gemm_userargs.yaml` — validates grouped GEMM with `UseUserArgs: True` across ScaleAlphaVec, Bias+Activation, and plain configurations. ## Test Result - [x] Test passed ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. AIHPBLAS-1220

## Motivation Comment two problem sizes of the new mx f8 tests that currently fail. ## Technical Details ## Test Plan Manually tested other problem sizes to verify they pass. ## Test Result  ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

## Motivation Improve MXFP4 instruction scheduling. Plus, reduce sgpr usage for MXFP4+StreamK ## Technical Details - Added if condition for MXFP4 in getMFMAIssueLatency - Fixed numGRIncInst calculation for MXFP4 - Use MinGRIncPerMfma=3 for MXFP4 - Disable staggerU code for MXFP4 + StreamK ## Test Plan CI test ## Test Result Check CI test ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

…k for activation inputs

…base

## Motivation Generate pre-shuffled scaling data that TensileLite can use for MX datatypes ## Technical Details Added optional `--enable-mx-preswizzle` flag to control MX FP4 scale pre-swizzle. When enabled, the flag allows scale tensors to be rearranged into gfx950 GPU kernel memory access patterns using parameters `{swizzleTileMN=32, tileK=8, subTileK=4}` and `{tileK=8, swizzleTileMN=32}`. Pre-swizzle requirements: flag enabled, solution available, `useScaleAB` populated, and dimension alignment satisfied (rows%256==0, cols%32==0 for mxBlock=32). ## Test Plan `MXPreSwizzleTest` validates scale permutation behavior and data buffer invariance across multiple matrix dimensions and transpose modes. ## Test Result Tests pass successfully ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. AIHPBLAS-789

## Motivation Fix fail with kringshift.yaml on gfx950_mx_rebase branch ## Technical Details Swapped the location of graAddresses and graFinalOffsets to align with develop branch. Before this fix, s[sgprSrdB+0] was referred before initialization in kringshift case. ## Test Plan Run kringshift.yaml ## Test Result kringshift.yaml passed. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

## Motivation optimize sgpr allocation for StoreSwapAddr ## Technical Details - use 1 common sgpr (swapCommon) for all (A,B,MXSA,MXSB) - use s_add m0, swapCommon, LocalWriteAddr for m0 initialization ## Test Plan Run existing dtl test ## Test Result dtl.yaml passed ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

## Motivation MXFP4 + DtlPlusLdsBuf was broken and need to enable it ## Technical Details Added missing code for MXFP4 + DtlPlusLdsBuf ## Test Plan Local test with MXFP4 + DtlPlusLdsBuf ## Test Result Local test passed ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

## Motivation Fix scheduling issues with MXFP4 + SIA3 ## Technical Details - fix incorrect local read count for MXSA/B in getLocalWriteMFMAStart - fix incorrect global read count for MXSA/B in getNumLocalWritePerMfma - optimize PointerLRCode scheduling for MXFP4 ## Test Plan tensilelite common test ## Test Result tensilelite common test passed ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

## Motivation This PR fixes failed gfx942 tests in CI. ## Technical Details For fnuz/non-fnuz f8 types, the `ComputeInputType` might appear in a concatenated form:`ComputeInputTypeAComputeInputTypeB` (e.g., `BFloat8Float8_fnuz`). This leads to type mismatches when selecting solutions. This PR adds logic to correctly interpret concatenated types by checking the type at the corresponding position. ## Test Plan Manually run these tests: gfx942: - `./clients/hipblaslt-test --gtest_filter=*pre_checkin_matmul_real_1b_fnuz_dst_1b_fnuz_smallsize_bf8_fnuz_rf8_fnuz_rbf8_fnuz_rbf8_fnuz_rf32_r_NT_3_128_128_1_3_128_2_3_3_1_SA_SB*` gfx950: - `./clients/hipblaslt-test --gtest_filter=*matmul_f8_dst_bf16*` - `./clients/hipblaslt-test --gtest_filter=*matmul_f8_bf8_dst_bf16*` - `./clients/hipblaslt-test --gtest_filter=*matmul_f8_bf8_dst_fp32*` ## Test Result  ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

…5273) ## Motivation - The error: [math-ci.amd.com/job/rocm-libraries/job/precheckin/job/hipsparselt/job/PR-5202/6/pipeline-overview/log?nodeId=628](https://math-ci.amd.com/job/rocm-libraries/job/precheckin/job/hipsparselt/job/PR-5202/6/pipeline-overview/log?nodeId=628) - Fix the inconsistency between LocalRead.py and KernelWriterAssembly.py - Change `setComputeInputType` to `setComputeInputTypeA` and `setComputeInputTypeB`, because `setComputeInputType` doesn't exist anymore.  ## Technical Details - Inside the LocalRead side, it tries to reference the wrong names, which are non-set sgprs such as `sgprPackKForV2` and `sgprPackKForV3`. It shall reference `sgprPackKForMV2/sgprPackKForMV3`, so I have to make it reference the right ones.  ## Test Plan - Please use the attached file `setup_hipsparselt_build` to set the packages and env variables the same as the CI. [setup_hipsparselt_build.sh](https://github.com/user-attachments/files/25870143/setup_hipsparselt_build.sh) - `./install.sh -c`  ## Test Result - first build with errors: [build_hipsparselt_1st.log](https://github.com/user-attachments/files/25870212/build_hipsparselt_1st.log) - final build WITHOUT errors: [build_hipsparselt_final.log](https://github.com/user-attachments/files/25870348/build_hipsparselt_final.log)  ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: tomchengchitang <tom.tang@amd.com>

- Add MacDataTypeA to mock kernel since it is required for getting the MI data type now from the utility function.

wen-des and others added 30 commits January 26, 2026 16:25

Solution and kernel writer rebased

3e4e91d

Contraction and DataType rebase

148d196

client modification

3b41f20

Parameters and LocalRead rebased

1bc4230

rocisa supportand other changes

32d700f

Components rebased

81c40d2

writer related change and yaml

28f2602

Conversion rebased

d89ad91

Fixed errors in compiling

2822693

Fixed python space issues

0df3258

Bugfixed in python files and generated kernel sucessfully

089cd5a

Disable swap address for mxsa/mxsb

74d6373

Committed some missing fixes

e7cb8f6

Fixed mxsa/mxsb address offset

f56cb78

Added TODO memo for later consideration

0649b1e

bpe function fix

b81161d

Bugfixed for the wrong address offset calculation

ee1714e

MX F8 functional testes passed in tensilelite

c89efb4

Updated f8 yaml file

cf1251a

Removed the mx f6 yaml files for mx f6 is not ready by now

8478d05

Updated f4 yaml file for test coverage

67b50bd

archana-ramalingam and others added 23 commits March 4, 2026 14:30

Merge branch 'gfx950_mx_rebase' into users/hongjche/gfx950_mx_rebase_…

f83ef8e

…sync

Merge branch 'develop' into users/hongjche/gfx950_mx_rebase_sync

24c36e1

Merge branch 'users/hongjche/gfx950_mx_rebase_sync' into gfx950_mx_re…

8de6b1a

…base

Merge branch 'develop' into users/hongjche/gfx950_mx_rebase_sync

20b1923

remove explicit constructor from variable_value class

e09725d

fix return statement in hipDataType_to_tensile_type and add type chec…

faa7dc7

…k for activation inputs

Merge branch 'users/hongjche/gfx950_mx_rebase_sync' into gfx950_mx_re…

ae18131

…base

Merge branch 'develop' into users/hongjche/gfx950_mx_rebase_sync

b967536

Fix gfx12 build error with integer cast

5babee6

Fix test_SolutionStructsUtilities

ed39e63

- Add MacDataTypeA to mock kernel since it is required for getting the MI data type now from the utility function.

talumbau requested review from a team as code owners March 11, 2026 18:34

talumbau closed this Mar 11, 2026

github-actions bot added project: hipblaslt project: hipsparselt ci:hipsparselt-fast labels Mar 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix testSolutionStructsUtilities Unit Test#5350

Fix testSolutionStructsUtilities Unit Test#5350
talumbau wants to merge 71 commits intodevelopfrom
users/talumbau/unit_test_macdatatypeA_fix

talumbau commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

Conversation

talumbau commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants