Fix testSolutionStructsUtilities Unit Test#5350
Closed
Conversation
Contributor
talumbau
commented
Mar 11, 2026
- Solution and kernel writer rebased
- Contraction and DataType rebase
- client modification
- Parameters and LocalRead rebased
- rocisa supportand other changes
- Components rebased
- writer related change and yaml
- Conversion rebased
- Fixed errors in compiling
- Fixed python space issues
- Bugfixed in python files and generated kernel sucessfully
- Disable swap address for mxsa/mxsb
- Committed some missing fixes
- Fixed mxsa/mxsb address offset
- Added TODO memo for later consideration
- bpe function fix
- Bugfixed for the wrong address offset calculation
- MX F8 functional testes passed in tensilelite
- Updated f8 yaml file
- Removed the mx f6 yaml files for mx f6 is not ready by now
- Updated f4 yaml file for test coverage
- Standardize kernel names with MX types (Standardize kernel names with MX types #4363)
- Fix some errors breaking non-mx tests on mx branch (Fix some errors breaking non-mx tests on mx branch #4616)
- Fix for gfx950 mxfp4 DirectToLds (Fix for gfx950 mxfp4 DirectToLds #4644)
- [hipBLASLt] Enable MX data generation for Tensile host and support calling Tensile MX kernel ([hipBLASLt] Enable MX data generation for Tensile host and support calling Tensile MX kernel #4599)
- [hipBLASLt] Add block size into predicate for correct solution selection ([hipBLASLt] Add block size into predicate for correct solution selection #4702)
- [Tensilelite] Add MXFP4 data generator for Tensile ([Tensilelite] Add MXFP4 data generator for Tensile #4597)
- Enable DirectToLds for MXSA/B and re-enable LdsPad for MXFP4 + DirectToLds ( Enable DirectToLds for MXSA/B and re-enable LdsPad for MXFP4 + DirectToLds #4683)
- Fix data initialization (Fix data initialization #4827)
- Fix a verification fail with MXFP4 + non DTL (Fix a verification fail with MXFP4 + non DTL #4715)
- [hipblaslt] Fixing build issues for gfx_950_mx_rebase ([hipblaslt] Fixing build issues for gfx_950_mx_rebase #4465)
- [TensileLite] Fix MX FP4 scale data overwrite in initializeCPUInputs ([TensileLite] Fix MX FP4 scale data initialization & argument scoping #4917)
- Fix stream-k with mx scaling (Fix stream-k with mx scaling #4388)
- [hipblaslt] Fix fails with dtl.yaml and xfp32.yaml on gfx950_mx_rebase ([hipblaslt] Fix fails with dtl.yaml and xfp32.yaml on gfx950_mx_rebase #4906)
- add kernel["ProblemType"]["Sparse"] to condition
- fix dependency issues for tensilelite clients
- fix computeInputType in tensilelite
- fix computeInputType issue in ReferenceValidator.cpp
- [hipblaslt] fix unit tests for gfx950_mx_rebase ([hipblaslt] fix unit tests for gfx950_mx_rebase #4912)
- [hipblaslt] Fix a verification fail with spmm_i8hs.yaml ([hipblaslt] Fix a verification fail with spmm_i8hs.yaml #5034)
- initial set of testcase for MXFP4 (initial set of testcase for MXFP4 #4739)
- [Tensilelite] Add regression test for MX FP4 scale buffer determinism ([Tensilelite] Add regression test for MX FP4 scale buffer determinism #4959)
- UseF32XEmulation in forceLrvwTile1 for B tensor (Fix bad merge: restore UseF32XEmulation in forceLrvwTile1 for B tensor #5143)
- [hipblaslt] Enable StoreSwapAddr for MXFP4, plus add GRVWMXSA/B adustment for DTL ([hipblaslt] Enable StoreSwapAddr for MXFP4, plus add GRVWMXSA/B adustment for DTL #5117)
- [Tensilelite] Fix UserArgs struct stride mismatch in grouped GEMM ( [Tensilelite] Fix UserArgs struct stride mismatch in grouped GEMM #5129)
- [hipBLASLt] Disable failed mx f8 problem sizes ([hipBLASLt] Disable failed mx f8 problem sizes #5105)
- [hipblaslt] Scheduling related fixes for MXFP4 ([hipblaslt] Scheduling related fixes for MXFP4 #5169)
- remove explicit constructor from variable_value class
- fix return statement in hipDataType_to_tensile_type and add type check for activation inputs
- [Tensilelite] Shuffle mx scaling data in Tensile ([Tensilelite] Shuffle mx scaling data in Tensile #4864)
- [hipblaslt] Fix fail with kringshift.yaml ([hipblaslt] Fix fail with kringshift.yaml #5228)
- [hipblaslt] Optimize StoreSwapAddr ([hipblaslt] Optimize StoreSwapAddr #5217)
- [hipblaslt] Enable MXFP4 + DtlPlusLdsBuf ([hipblaslt] Enable MXFP4 + DtlPlusLdsBuf #5251)
- Fix gfx12 build error with integer cast
- [hipblaslt] Fix SIA3 issues with MXFP4 ([hipblaslt] Fix SIA3 issues with MXFP4 #5245)
- [hipBLASLt] Fix CI failures for gfx942 ([hipBLASLt] Fix CI failures for gfx942 #5216)
- Make the usage side’s logic consistent with allocation side (tPackM) (Make the usage side’s logic consistent with allocation side (tPackM) #5273)
- Fix test_SolutionStructsUtilities
## Motivation Standardize kernel names with MX types. ## Technical Details Removes redundant underscores, and makes some minor adjustments to make MX kernel names consistent with the rest of the library. ## Test Plan Checked kernel names with the mx test.
## Motivation Fix some errors in codegen that break non-mx tests. ## Technical Details calcLdsPad was refactored to include LRVWA/B. Arg list and call points needed to be updated to match. bpeGR now contains a floating point value. Offsets need to be converted to int before written to assembly since offsets cannot be floats. ## Test Plan Only tested with stream-k unit test so far (non-mx). That one now passes.
## Motivation Enable DTL for gfx950 mxfp4 ## Technical Details - Fixed inconsistent conditions for MXSA/B vreg allocation and local write code generation - Temporarily disabled UseGeneralizedNLCOneA/B (causes issue with MXSA/B) ## Test Plan Single test with tensilelite yaml ## Test Result Local test passed ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
…lling Tensile MX kernel (#4599) ## Motivation This PR enables using mxDataGenerator when Tensile is the host and supports calling FP4 kernels generated from Tensile. ## Technical Details - Add a FP4 library (yaml) generated by Tensile under GridBased category: [YAML](https://github.com/ROCm/rocm-libraries/blob/1848898c099b127bf77f8306678150953f563648/projects/hipblaslt/library/src/amd_detail/rocblaslt/src/Tensile/Logic/asm_full/gfx950/GridBased/aquavanjaram_Cijk_Alik_Bljk___F_4_S_S___M_X_A_3_2___M_X_B_3_2___BH_Bias_S_HA_S_SAV_UserArgs.yaml) - Remove macros to use mxDataGenerator regardless which host used. Now **the default C++ standard is set to C++20** as it is required by mxDataGenerator. - Support calling Tensile FP4 solutions ## Test Plan Use cmake preset build with rocRoller host **off** (i.e., use Tensile as host) , gpu target set to gfx950 and `-DBUILD_TESTING:BOOL=OFF` (turn off tensileLite test which will error out during build) - Use `hipblaslt-test` `./clients/hipblaslt-test --gtest_filter=*matmul_tensile_fp4*` - Use `hipblaslt-bench` `./clients/hipblaslt-bench --iters 0 --cold_iters 0 --transA T --transB N --a_type f4_r --b_type f4_r --c_type f32_r --d_type f32_r -m 256 -n 256 -k 256 --alpha 2.1 --beta 0.7 --scaleA 3 --scaleB 3 --scale_type f32_r --verify` ## Test Result ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
…ion (#4702) ## Motivation This PR updates the predicate to check the block size of mx data types. If block size is not considered, wrong solution might be selected. ## Technical Details - Add MX block size A and B into predicate and serialization ## Test Plan Manually tried: `./clients/hipblaslt-bench --iters 0 --cold_iters 0 --transA T --transB N --a_type f8_r --b_type f8_r --c_type f32_r --d_type f32_r -m 256 -n 256 --alpha 2.1 --beta 0.7 --scaleA 3 --scaleB 3 --scale_type f32_r --verify` And no solution found (currently there is no solution for mx f8). ## Test Result ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Motivation Add MXFP4 data generator for Tensile ## Technical Details Integrate Hipblaslt’s MXFP4 data generator into TensilteLite. The goal is to decrease the amount of zero-valued data so that it better resembles production workloads, thereby improving real‑time performance accuracy. After integration, zero frequency drops by 0.5%, and stays in the range of ~12.5%-13%. Note: Upgrades to C++20 as mxDataGenerator requires it. ## Test Plan MXDataGen_test.cpp - verify the integration works & that the zero frequency is below a fixed threshold. ## Test Result TBD ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. AIHPBLAS-796
…tToLds (#4683) ## Motivation Improve MXFP4 performance with DirectoLds ## Technical Details - Enable DirectToLds for MXSA/B - Re-enable LdsPad for MXFP4 + DirectToLds - Added test cases to mx32f4_tn.yaml ## Test Plan Added test case to tensililite yaml ## Test Result Local test passed ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Motivation Builds didn't work if hipBLASLt wasn't already installed. ## Technical Details Changed the include paths so MXDataGen didn't need hipBLASLT.hpp ## Test Plan Build locally without hipBLASLt already installed. ## Test Result Build worked. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Motivation Fix a verification fail with MXFP4 + non DTL ## Technical Details Fixed incorrect waitcnt for prefetch local read with SubIter case Also, fixed the location of second waitcnt for TN case (use MIWaveTileA // 2 for SubIter=2 case) ## Test Plan Local test ## Test Result Local test passed ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
…#4959) ## Motivation Write test to catch post-generation overwrites of `MXSA`/`MXSB` buffers that desync the CPU reference from GPU data, causing intermittent MX FP4 validation failures. ## Technical Details Added `MXScaleDeterminismTest/ScaleBufferIsDeterministic` to `MXDataGen_test.cpp` - calls `generateMXInput` twice with sentinel-initialized scale buffers (`0x00` and `0xFF`) and asserts equality - the differing sentinels prevent a silent no-write from passing. Includes non-32-aligned cases (M=204, M=213) that failed due to `initializeCPUInputs` overwriting `MXSA`/`MXSB` after `generateMXInput` populated them. ## Test Plan - `MXScaleDeterminismTest/ScaleBufferIsDeterministic` — 4 cases ## Test Result - [x] All tests passing ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Summary - When `gfx950_mx_rebase` split `forceLrvwTile1` into per-tensor definitions (using `MacDataTypeA`/`MacDataTypeB` instead of `DataType`), the B tensor copy lost the `not UseF32XEmulation` guard present on `develop` - This caused TF32 kernels to get `lrvwTileB=1` instead of `VectorWidthB=2`, halving the local read block width and doubling the number of LRB instructions (8 → 16), breaking CMS schedules - One-line fix: add `and (not kernel["UseF32XEmulation"])` to the B tensor `forceLrvwTile1` definition ## Test plan - [x] `xfp32.yaml` passes - [x] `cms_tf32_nt.yaml` passes (extracted from `custom_mainloop_scheduling_tf32.yaml`) 🤖 Generated with [Claude Code](https://claude.com/claude-code) AIHPBLAS-1221 AIHPBLAS-1224 Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…ment for DTL (#5117) ## Motivation Enable MT256x256x256 for MXFP4. Need StoreSwapAddr and GRVWMXSA/B adjustment for DTL ## Technical Details - Enable StoreSwapAddr for MXFP4 - Added GRVWMXSA/B adjustment logic for DTL (64bit to 32bit) ## Test Plan CI test ## Test Result CI test ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
…5129) ## Motivation Grouped GEMM with `UseUserArgs: True` crashes with `hipErrorIllegalAddress` because the kernel's `userArgsInfo.totalSize` doesn't match `sizeof(DeviceUserArguments)`. ## Technical Details `DeviceUserArguments` is a fixed 196-byte packed struct, but `userArgsInfo` field sizes in `Signature.py` were only set inside feature-gated blocks, leaving them at 0 when features were disabled. This made `totalSize` (stride between per-problem entries) and `extArgOffset` (epilogue field offsets) smaller than the struct, causing the kernel to read from wrong offsets for multi-problem grouped GEMMs. Fix: move size assignments outside conditional blocks to always match the struct layout. ## Test Plan Existing test: `grouped_gemm_userargs.yaml` — validates grouped GEMM with `UseUserArgs: True` across ScaleAlphaVec, Bias+Activation, and plain configurations. ## Test Result - [x] Test passed ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. AIHPBLAS-1220
## Motivation Comment two problem sizes of the new mx f8 tests that currently fail. ## Technical Details ## Test Plan Manually tested other problem sizes to verify they pass. ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Motivation
Improve MXFP4 instruction scheduling. Plus, reduce sgpr usage for
MXFP4+StreamK
## Technical Details
- Added if condition for MXFP4 in getMFMAIssueLatency
- Fixed numGRIncInst calculation for MXFP4
- Use MinGRIncPerMfma=3 for MXFP4
- Disable staggerU code for MXFP4 + StreamK
## Test Plan
CI test
## Test Result
Check CI test
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
…k for activation inputs
## Motivation
Generate pre-shuffled scaling data that TensileLite can use for MX
datatypes
## Technical Details
Added optional `--enable-mx-preswizzle` flag to control MX FP4 scale
pre-swizzle.
When enabled, the flag allows scale tensors to be rearranged into gfx950
GPU kernel memory access patterns using parameters `{swizzleTileMN=32,
tileK=8, subTileK=4}` and `{tileK=8, swizzleTileMN=32}`.
Pre-swizzle requirements: flag enabled, solution available, `useScaleAB`
populated, and dimension alignment satisfied (rows%256==0, cols%32==0
for mxBlock=32).
## Test Plan
`MXPreSwizzleTest` validates scale permutation behavior and data buffer
invariance across multiple matrix dimensions and transpose modes.
## Test Result
Tests pass successfully
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
AIHPBLAS-789
## Motivation Fix fail with kringshift.yaml on gfx950_mx_rebase branch ## Technical Details Swapped the location of graAddresses and graFinalOffsets to align with develop branch. Before this fix, s[sgprSrdB+0] was referred before initialization in kringshift case. ## Test Plan Run kringshift.yaml ## Test Result kringshift.yaml passed. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Motivation optimize sgpr allocation for StoreSwapAddr ## Technical Details - use 1 common sgpr (swapCommon) for all (A,B,MXSA,MXSB) - use s_add m0, swapCommon, LocalWriteAddr for m0 initialization ## Test Plan Run existing dtl test ## Test Result dtl.yaml passed ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Motivation MXFP4 + DtlPlusLdsBuf was broken and need to enable it ## Technical Details Added missing code for MXFP4 + DtlPlusLdsBuf ## Test Plan Local test with MXFP4 + DtlPlusLdsBuf ## Test Result Local test passed ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Motivation Fix scheduling issues with MXFP4 + SIA3 ## Technical Details - fix incorrect local read count for MXSA/B in getLocalWriteMFMAStart - fix incorrect global read count for MXSA/B in getNumLocalWritePerMfma - optimize PointerLRCode scheduling for MXFP4 ## Test Plan tensilelite common test ## Test Result tensilelite common test passed ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Motivation This PR fixes failed gfx942 tests in CI. ## Technical Details For fnuz/non-fnuz f8 types, the `ComputeInputType` might appear in a concatenated form:`ComputeInputTypeAComputeInputTypeB` (e.g., `BFloat8Float8_fnuz`). This leads to type mismatches when selecting solutions. This PR adds logic to correctly interpret concatenated types by checking the type at the corresponding position. ## Test Plan Manually run these tests: gfx942: - `./clients/hipblaslt-test --gtest_filter=*pre_checkin_matmul_real_1b_fnuz_dst_1b_fnuz_smallsize_bf8_fnuz_rf8_fnuz_rbf8_fnuz_rbf8_fnuz_rf32_r_NT_3_128_128_1_3_128_2_3_3_1_SA_SB*` gfx950: - `./clients/hipblaslt-test --gtest_filter=*matmul_f8_dst_bf16*` - `./clients/hipblaslt-test --gtest_filter=*matmul_f8_bf8_dst_bf16*` - `./clients/hipblaslt-test --gtest_filter=*matmul_f8_bf8_dst_fp32*` ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
…5273) ## Motivation - The error: [math-ci.amd.com/job/rocm-libraries/job/precheckin/job/hipsparselt/job/PR-5202/6/pipeline-overview/log?nodeId=628](https://math-ci.amd.com/job/rocm-libraries/job/precheckin/job/hipsparselt/job/PR-5202/6/pipeline-overview/log?nodeId=628) - Fix the inconsistency between LocalRead.py and KernelWriterAssembly.py - Change `setComputeInputType` to `setComputeInputTypeA` and `setComputeInputTypeB`, because `setComputeInputType` doesn't exist anymore. <!-- Explain the purpose of this PR and the goals it aims to achieve. --> ## Technical Details - Inside the LocalRead side, it tries to reference the wrong names, which are non-set sgprs such as `sgprPackKForV2` and `sgprPackKForV3`. It shall reference `sgprPackKForMV2/sgprPackKForMV3`, so I have to make it reference the right ones. <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan - Please use the attached file `setup_hipsparselt_build` to set the packages and env variables the same as the CI. [setup_hipsparselt_build.sh](https://github.com/user-attachments/files/25870143/setup_hipsparselt_build.sh) - `./install.sh -c` <!-- Explain any relevant testing done to verify this PR. --> ## Test Result - first build with errors: [build_hipsparselt_1st.log](https://github.com/user-attachments/files/25870212/build_hipsparselt_1st.log) - final build WITHOUT errors: [build_hipsparselt_final.log](https://github.com/user-attachments/files/25870348/build_hipsparselt_final.log) <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: tomchengchitang <tom.tang@amd.com>
- Add MacDataTypeA to mock kernel since it is required for getting the MI data type now from the utility function.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.