Open
Conversation
perfci run on commit 1a55eba |
Performance Report for gfx12ResultsDetailsComparison Summary
@@ Significant (p-val <0.05) Performance Diffs @@
====================================================================================================
- 0.15% | p=2.5347e-02
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8192, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 64, wave_m: 16, wave_n: 16, wave_k: 16, wave_b: 1, workgroup_size_x: 64, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx1201', 'Xnack': False, 'Sramecc': False}, tailLoops: True, version: 70a312ff466) | GEMMProblem(M=7680, N=8448, K=8192, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=64, wave_m=16, wave_n=16, wave_k=16, wave_b=1, workgroup_size_x=64, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx1201', 'Xnack': False, 'Sramecc': False}, tailLoops=True, version='')
+ 1.24% | p=3.2348e-04
| CodeGen(instCount: 40000, instructions: comments)| CodeGen() | CodeGen(instCount: 40000, instructions: comments)
- 1.81% | p=5.6994e-05
| CodeGen(instCount: 40000, instructions: complex_mi_with_coop)| CodeGen() | CodeGen(instCount: 40000, instructions: complex_mi_with_coop)
- 2.00% | p=5.6994e-05
| CodeGen(instCount: 40000, instructions: simple_mi)| CodeGen() | CodeGen(instCount: 40000, instructions: simple_mi)
|
Resource Report for gfx12ResultsDetails✔️ No Resource Usage Changes ✔️ |
Performance Report for gfx950ResultsDetailsComparison Summary
@@ Significant (p-val <0.05) Performance Diffs @@
====================================================================================================
+ 0.40% | p=5.5589e-03
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 256, wave_m: 16, wave_n: 16, wave_k: 128, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDS, load_B: BufferToLDS, store: VGPRToGlobalMemoryWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeightedSimple, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: True, swizzleTileSize: {'m': 32, 'k': 8, 'n': 32, 'l': 8}, prefetchScale: True, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 70a312ff466) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=256, wave_m=16, wave_n=16, wave_k=128, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDS', load_B='BufferToLDS', store='VGPRToGlobalMemoryWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeightedSimple', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=True, swizzleTileSize={'m': 32, 'k': 8, 'n': 32, 'l': 8}, prefetchScale=True, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
+ 0.31% | p=2.3575e-03
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 256, wave_m: 32, wave_n: 32, wave_k: 64, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDS, load_B: BufferToLDS, store: VGPRToGlobalMemoryWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: True, swizzleTileSize: {'m': 32, 'k': 8, 'n': 32, 'l': 8}, prefetchScale: True, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 70a312ff466) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=256, wave_m=32, wave_n=32, wave_k=64, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDS', load_B='BufferToLDS', store='VGPRToGlobalMemoryWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=True, swizzleTileSize={'m': 32, 'k': 8, 'n': 32, 'l': 8}, prefetchScale=True, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
+ 0.35% | p=2.8427e-02
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 256, mac_n: 256, mac_k: 128, wave_m: 16, wave_n: 16, wave_k: 128, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDS, load_B: BufferToLDS, store: VGPRToGlobalMemoryWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeightedSimple, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: True, swizzleTileSize: {'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale: True, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 70a312ff466) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=256, mac_n=256, mac_k=128, wave_m=16, wave_n=16, wave_k=128, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDS', load_B='BufferToLDS', store='VGPRToGlobalMemoryWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeightedSimple', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=True, swizzleTileSize={'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale=True, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
+ 0.36% | p=3.8816e-05
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 256, mac_n: 256, mac_k: 128, wave_m: 32, wave_n: 32, wave_k: 64, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDS, load_B: BufferToLDS, store: VGPRToGlobalMemoryWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: True, swizzleTileSize: {'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale: True, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 70a312ff466) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=256, mac_n=256, mac_k=128, wave_m=32, wave_n=32, wave_k=64, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDS', load_B='BufferToLDS', store='VGPRToGlobalMemoryWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=True, swizzleTileSize={'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale=True, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
+ 0.37% | p=1.7451e-03
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: 0, workgroupMappingValue: 2)| GEMM(mac_m: 128, mac_n: 128, mac_k: 256, wave_m: 32, wave_n: 32, wave_k: 64, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDS, load_B: BufferToLDS, store: VGPRToGlobalMemoryWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: True, swizzleTileSize: {'m': 32, 'k': 8, 'n': 32, 'l': 8}, prefetchScale: True, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 70a312ff466) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=0, workgroupMappingValue=2)GEMMSolution(mac_m=128, mac_n=128, mac_k=256, wave_m=32, wave_n=32, wave_k=64, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDS', load_B='BufferToLDS', store='VGPRToGlobalMemoryWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=True, swizzleTileSize={'m': 32, 'k': 8, 'n': 32, 'l': 8}, prefetchScale=True, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
+ 0.64% | p=4.7085e-05
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: 0, workgroupMappingValue: 2)| GEMM(mac_m: 256, mac_n: 256, mac_k: 128, wave_m: 16, wave_n: 16, wave_k: 128, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDS, load_B: BufferToLDS, store: VGPRToGlobalMemoryWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeightedSimple, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: True, swizzleTileSize: {'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale: True, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 70a312ff466) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=0, workgroupMappingValue=2)GEMMSolution(mac_m=256, mac_n=256, mac_k=128, wave_m=16, wave_n=16, wave_k=128, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDS', load_B='BufferToLDS', store='VGPRToGlobalMemoryWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeightedSimple', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=True, swizzleTileSize={'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale=True, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
+ 0.52% | p=2.9185e-04
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: 0, workgroupMappingValue: 2)| GEMM(mac_m: 256, mac_n: 256, mac_k: 128, wave_m: 32, wave_n: 32, wave_k: 64, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDS, load_B: BufferToLDS, store: VGPRToGlobalMemoryWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: True, swizzleTileSize: {'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale: True, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 70a312ff466) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=0, workgroupMappingValue=2)GEMMSolution(mac_m=256, mac_n=256, mac_k=128, wave_m=32, wave_n=32, wave_k=64, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDS', load_B='BufferToLDS', store='VGPRToGlobalMemoryWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=True, swizzleTileSize={'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale=True, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
+ 0.24% | p=1.3906e-02
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'PreSwizzleScale', 'scaleShuffleTileA': [32, 8, 2], 'scaleShuffleTileB': [32, 8, 2], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: 0, workgroupMappingValue: 2)| GEMM(mac_m: 128, mac_n: 128, mac_k: 256, wave_m: 32, wave_n: 32, wave_k: 64, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDS, load_B: BufferToLDS, store: VGPRToGlobalMemoryWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: True, swizzleTileSize: {'m': 32, 'k': 8, 'n': 32, 'l': 8}, prefetchScale: True, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 70a312ff466) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'PreSwizzleScale', 'scaleShuffleTileA': [32, 8, 2], 'scaleShuffleTileB': [32, 8, 2], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=0, workgroupMappingValue=2)GEMMSolution(mac_m=128, mac_n=128, mac_k=256, wave_m=32, wave_n=32, wave_k=64, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDS', load_B='BufferToLDS', store='VGPRToGlobalMemoryWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=True, swizzleTileSize={'m': 32, 'k': 8, 'n': 32, 'l': 8}, prefetchScale=True, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
+ 0.37% | p=7.9623e-04
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'PreSwizzleScale', 'scaleShuffleTileA': [32, 8, 4], 'scaleShuffleTileB': [32, 8, 4], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: 0, workgroupMappingValue: 2)| GEMM(mac_m: 128, mac_n: 128, mac_k: 256, wave_m: 16, wave_n: 16, wave_k: 128, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDS, load_B: BufferToLDS, store: VGPRToGlobalMemoryWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeightedSimple, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: True, swizzleTileSize: {'m': 32, 'k': 8, 'n': 32, 'l': 8}, prefetchScale: True, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 70a312ff466) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'PreSwizzleScale', 'scaleShuffleTileA': [32, 8, 4], 'scaleShuffleTileB': [32, 8, 4], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=0, workgroupMappingValue=2)GEMMSolution(mac_m=128, mac_n=128, mac_k=256, wave_m=16, wave_n=16, wave_k=128, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDS', load_B='BufferToLDS', store='VGPRToGlobalMemoryWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeightedSimple', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=True, swizzleTileSize={'m': 32, 'k': 8, 'n': 32, 'l': 8}, prefetchScale=True, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
+ 0.50% | p=3.6504e-03
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'PreSwizzleScale', 'scaleShuffleTileA': [64, 4, 2], 'scaleShuffleTileB': [64, 4, 2], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: 0, workgroupMappingValue: 2)| GEMM(mac_m: 256, mac_n: 256, mac_k: 128, wave_m: 32, wave_n: 32, wave_k: 64, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDS, load_B: BufferToLDS, store: VGPRToGlobalMemoryWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: True, swizzleTileSize: {'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale: True, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 70a312ff466) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'PreSwizzleScale', 'scaleShuffleTileA': [64, 4, 2], 'scaleShuffleTileB': [64, 4, 2], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=0, workgroupMappingValue=2)GEMMSolution(mac_m=256, mac_n=256, mac_k=128, wave_m=32, wave_n=32, wave_k=64, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDS', load_B='BufferToLDS', store='VGPRToGlobalMemoryWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=True, swizzleTileSize={'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale=True, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
+ 0.42% | p=3.6504e-03
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'PreSwizzleScale', 'scaleShuffleTileA': [64, 4, 4], 'scaleShuffleTileB': [64, 4, 4], 'scalePreTileA': [64, 4], 'scalePreTileB': [4, 64]}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 256, mac_n: 256, mac_k: 256, wave_m: 16, wave_n: 16, wave_k: 128, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDS, load_B: BufferToLDS, store: VGPRToGlobalMemoryWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeightedSimple, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToLDS, loadScale_B: BufferToLDS, swizzleScale: True, swizzleTileSize: {'m': 64, 'k': 4, 'n': 64, 'l': 4}, prefetchScale: False, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops: False, version: 70a312ff466) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'PreSwizzleScale', 'scaleShuffleTileA': [64, 4, 4], 'scaleShuffleTileB': [64, 4, 4], 'scalePreTileA': [64, 4], 'scalePreTileB': [4, 64]}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=256, mac_n=256, mac_k=256, wave_m=16, wave_n=16, wave_k=128, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDS', load_B='BufferToLDS', store='VGPRToGlobalMemoryWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeightedSimple', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToLDS', loadScale_B='BufferToLDS', swizzleScale=True, swizzleTileSize={'m': 64, 'k': 4, 'n': 64, 'l': 4}, prefetchScale=False, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops=False, version='')
+ 0.67% | p=6.1069e-09
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'PreSwizzleScale', 'scaleShuffleTileA': [64, 4, 4], 'scaleShuffleTileB': [64, 4, 4], 'scalePreTileA': [64, 4], 'scalePreTileB': [4, 64]}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 256, mac_n: 256, mac_k: 256, wave_m: 16, wave_n: 16, wave_k: 128, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDS, load_B: BufferToLDS, store: VGPRToGlobalMemoryWithBuffer, betaInFma: True, padLDS_A: [2048, 8], padLDS_B: [2048, 8], scheduler: Priority, schedulerCost: LinearWeightedSimple, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToLDS, loadScale_B: BufferToLDS, swizzleScale: True, swizzleTileSize: {'m': 64, 'k': 4, 'n': 64, 'l': 4}, prefetchScale: True, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops: False, version: 70a312ff466) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'PreSwizzleScale', 'scaleShuffleTileA': [64, 4, 4], 'scaleShuffleTileB': [64, 4, 4], 'scalePreTileA': [64, 4], 'scalePreTileB': [4, 64]}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=256, mac_n=256, mac_k=256, wave_m=16, wave_n=16, wave_k=128, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDS', load_B='BufferToLDS', store='VGPRToGlobalMemoryWithBuffer', betaInFma=True, padLDS_A=[2048, 8], padLDS_B=[2048, 8], scheduler='Priority', schedulerCost='LinearWeightedSimple', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToLDS', loadScale_B='BufferToLDS', swizzleScale=True, swizzleTileSize={'m': 64, 'k': 4, 'n': 64, 'l': 4}, prefetchScale=True, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops=False, version='')
+ 0.42% | p=6.2761e-06
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'PreSwizzleScale', 'scaleShuffleTileA': [64, 4, 4], 'scaleShuffleTileB': [64, 4, 4], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 256, mac_n: 256, mac_k: 256, wave_m: 16, wave_n: 16, wave_k: 128, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDS, load_B: BufferToLDS, store: VGPRToGlobalMemoryWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeightedSimple, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 1, prefetchMixMemOps: False, loadScale_A: BufferToLDS, loadScale_B: BufferToLDS, swizzleScale: True, swizzleTileSize: {'m': 64, 'k': 8, 'n': 64, 'l': 8}, prefetchScale: False, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 70a312ff466) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'PreSwizzleScale', 'scaleShuffleTileA': [64, 4, 4], 'scaleShuffleTileB': [64, 4, 4], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=256, mac_n=256, mac_k=256, wave_m=16, wave_n=16, wave_k=128, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDS', load_B='BufferToLDS', store='VGPRToGlobalMemoryWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeightedSimple', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=1, prefetchMixMemOps=False, loadScale_A='BufferToLDS', loadScale_B='BufferToLDS', swizzleScale=True, swizzleTileSize={'m': 64, 'k': 8, 'n': 64, 'l': 8}, prefetchScale=False, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
+ 1.00% | p=1.5662e-09
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'SingleScale', 'scaleType_A': 'E8M0', 'scale_B': 'SingleScale', 'scaleType_B': 'E8M0', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 0.00999999978, scaleValue_B: 0.00999999978, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: 0, workgroupMappingValue: 2)| GEMM(mac_m: 256, mac_n: 256, mac_k: 128, wave_m: 16, wave_n: 16, wave_k: 128, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDS, load_B: BufferToLDS, store: VGPRToGlobalMemoryWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeightedSimple, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 70a312ff466) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'SingleScale', 'scaleType_A': 'E8M0', 'scale_B': 'SingleScale', 'scaleType_B': 'E8M0', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=0.00999999978, scaleValue_B=0.00999999978, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=0, workgroupMappingValue=2)GEMMSolution(mac_m=256, mac_n=256, mac_k=128, wave_m=16, wave_n=16, wave_k=128, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDS', load_B='BufferToLDS', store='VGPRToGlobalMemoryWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeightedSimple', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
+ 0.22% | p=6.8857e-05
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 128, wave_m: 16, wave_n: 16, wave_k: 128, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDS, load_B: BufferToLDS, store: VGPRToGlobalMemoryWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 70a312ff466) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=128, wave_m=16, wave_n=16, wave_k=128, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDS', load_B='BufferToLDS', store='VGPRToGlobalMemoryWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
+ 0.41% | p=8.9649e-10
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: 1, workgroupMappingValue: 2)| GEMM(mac_m: 128, mac_n: 128, mac_k: 128, wave_m: 16, wave_n: 16, wave_k: 128, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDS, load_B: BufferToLDS, store: VGPRToGlobalMemoryWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 70a312ff466) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=1, workgroupMappingValue=2)GEMMSolution(mac_m=128, mac_n=128, mac_k=128, wave_m=16, wave_n=16, wave_k=128, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDS', load_B='BufferToLDS', store='VGPRToGlobalMemoryWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
+ 0.18% | p=1.7777e-02
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 128, wave_m: 16, wave_n: 16, wave_k: 128, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDS, load_B: BufferToLDS, store: VGPRToGlobalMemoryWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: True, swizzleTileSize: {'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale: True, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 70a312ff466) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=128, wave_m=16, wave_n=16, wave_k=128, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDS', load_B='BufferToLDS', store='VGPRToGlobalMemoryWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=True, swizzleTileSize={'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale=True, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
+ 1.62% | p=1.9444e-64
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 128, wave_m: 32, wave_n: 32, wave_k: 64, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDS, load_B: BufferToLDS, store: VGPRToGlobalMemoryWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: True, swizzleTileSize: {'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale: True, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 70a312ff466) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=128, wave_m=32, wave_n=32, wave_k=64, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDS', load_B='BufferToLDS', store='VGPRToGlobalMemoryWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=True, swizzleTileSize={'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale=True, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
+ 0.40% | p=1.2004e-04
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0.5, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 256, mac_n: 256, mac_k: 128, wave_m: 32, wave_n: 32, wave_k: 64, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDS, load_B: BufferToLDS, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 1, prefetchMixMemOps: False, loadScale_A: BufferToLDSViaVGPR, loadScale_B: BufferToLDSViaVGPR, swizzleScale: False, swizzleTileSize: {'m': 64, 'k': 8, 'n': 64, 'l': 8}, prefetchScale: False, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 70a312ff466) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0.5, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=256, mac_n=256, mac_k=128, wave_m=32, wave_n=32, wave_k=64, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDS', load_B='BufferToLDS', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=1, prefetchMixMemOps=False, loadScale_A='BufferToLDSViaVGPR', loadScale_B='BufferToLDSViaVGPR', swizzleScale=False, swizzleTileSize={'m': 64, 'k': 8, 'n': 64, 'l': 8}, prefetchScale=False, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
+ 0.23% | p=2.4527e-04
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0.5, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 256, mac_n: 256, mac_k: 128, wave_m: 32, wave_n: 32, wave_k: 64, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 2, prefetchMixMemOps: False, loadScale_A: BufferToLDSViaVGPR, loadScale_B: BufferToLDSViaVGPR, swizzleScale: False, swizzleTileSize: {'m': 64, 'k': 8, 'n': 64, 'l': 8}, prefetchScale: False, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 70a312ff466) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0.5, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=256, mac_n=256, mac_k=128, wave_m=32, wave_n=32, wave_k=64, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=2, prefetchMixMemOps=False, loadScale_A='BufferToLDSViaVGPR', loadScale_B='BufferToLDSViaVGPR', swizzleScale=False, swizzleTileSize={'m': 64, 'k': 8, 'n': 64, 'l': 8}, prefetchScale=False, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
+ 0.21% | p=1.0436e-37
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0.5, types: {'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 256, mac_k: 128, wave_m: 16, wave_n: 16, wave_k: 128, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 2, prefetchMixMemOps: False, loadScale_A: BufferToLDSViaVGPR, loadScale_B: BufferToLDSViaVGPR, swizzleScale: False, swizzleTileSize: {'m': 64, 'k': 8, 'n': 64, 'l': 8}, prefetchScale: False, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 70a312ff466) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0.5, types={'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=256, mac_k=128, wave_m=16, wave_n=16, wave_k=128, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=2, prefetchMixMemOps=False, loadScale_A='BufferToLDSViaVGPR', loadScale_B='BufferToLDSViaVGPR', swizzleScale=False, swizzleTileSize={'m': 64, 'k': 8, 'n': 64, 'l': 8}, prefetchScale=False, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
+ 0.18% | p=5.7414e-04 Results truncated, see full report in workspace |
Resource Report for gfx950ResultsDetails✔️ No Resource Usage Changes ✔️ |
Performance Report for gfx942ResultsDetailsComparison Summary
@@ Significant (p-val <0.05) Performance Diffs @@
====================================================================================================
+ 0.02% | p=2.5347e-02
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8192, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 64, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 29a58fd2448) | GEMMProblem(M=7680, N=8448, K=8192, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=64, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
- 0.05% | p=5.6994e-05
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8192, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 64, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 29a58fd2448) | GEMMProblem(M=7680, N=8448, K=8192, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=64, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
+ 0.59% | p=1.7451e-03
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: TwoTile, numWGs: 304, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 29a58fd2448) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='TwoTile', numWGs=304, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
- 0.35% | p=1.7451e-03
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: TwoTileDPFirst, numWGs: 304, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 29a58fd2448) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='TwoTileDPFirst', numWGs=304, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
+ 2.46% | p=5.6994e-05
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 64, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: Standard, numWGs: 304, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 29a58fd2448) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=64, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='Standard', numWGs=304, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
+ 0.75% | p=1.7451e-03
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 64, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: TwoTileDPFirst, numWGs: 304, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 29a58fd2448) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=64, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='TwoTileDPFirst', numWGs=304, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
+ 1.33% | p=1.7451e-03
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 64, mac_k: 32, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: Standard, numWGs: 304, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 29a58fd2448) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=64, mac_k=32, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='Standard', numWGs=304, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
- 1.71% | p=1.7451e-03
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 64, mac_k: 32, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: TwoTile, numWGs: 304, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 29a58fd2448) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=64, mac_k=32, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='TwoTile', numWGs=304, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
- 0.31% | p=2.5347e-02
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 128, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: TwoTileDPFirst, numWGs: 304, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 29a58fd2448) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=128, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='TwoTileDPFirst', numWGs=304, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
- 1.06% | p=1.7451e-03
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 128, mac_k: 32, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: Standard, numWGs: 304, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 29a58fd2448) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=128, mac_k=32, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='Standard', numWGs=304, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
+ 1.27% | p=2.5347e-02
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 128, mac_k: 32, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: TwoTile, numWGs: 304, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 29a58fd2448) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=128, mac_k=32, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='TwoTile', numWGs=304, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
+ 0.21% | p=1.7451e-03
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 256, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: Standard, numWGs: 304, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 29a58fd2448) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=256, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='Standard', numWGs=304, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
+ 0.29% | p=1.7451e-03
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 256, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: TwoTileDPFirst, numWGs: 304, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 29a58fd2448) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=256, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='TwoTileDPFirst', numWGs=304, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
+ 0.75% | p=2.5347e-02
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 256, mac_k: 32, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: TwoTile, numWGs: 304, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 29a58fd2448) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=256, mac_k=32, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='TwoTile', numWGs=304, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
- 1.49% | p=1.7451e-03
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: Standard, numWGs: 304, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 29a58fd2448) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='Standard', numWGs=304, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
- 1.79% | p=5.6994e-05
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: TwoTileDPFirst, numWGs: 304, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 29a58fd2448) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='TwoTileDPFirst', numWGs=304, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
- 1.29% | p=1.7451e-03
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 2, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: Standard, numWGs: 304, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 29a58fd2448) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=2, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='Standard', numWGs=304, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
- 0.60% | p=1.7451e-03
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 32, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: TwoTile, numWGs: 304, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 29a58fd2448) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=32, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='TwoTile', numWGs=304, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
- 0.48% | p=1.7451e-03
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 32, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: TwoTileDPFirst, numWGs: 304, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 29a58fd2448) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=32, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='TwoTileDPFirst', numWGs=304, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
+ 1.42% | p=5.6994e-05
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 64, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: TwoTile, numWGs: 304, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 29a58fd2448) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=64, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='TwoTile', numWGs=304, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
- 0.76% | p=5.6994e-05
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 64, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: TwoTileDPFirst, numWGs: 304, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 29a58fd2448) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=64, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='TwoTileDPFirst', numWGs=304, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
+ 6.91% | p=2.5347e-02 Results truncated, see full report in workspace |
Resource Report for gfx942ResultsDetails✔️ No Resource Usage Changes ✔️ |
Code Coverage Report for gfx942Summary
Artifacts
Commit Hashes |
memmett
reviewed
Feb 26, 2026
Codecov Report✅ All modified and coverable lines are covered by tests. ❌ Your project status has failed because the head coverage (76.83%) is below the target coverage (80.00%). You can increase the head coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## develop #4904 +/- ##
========================================
Coverage 65.96% 65.96%
========================================
Files 1720 1720
Lines 267465 267465
Branches 37091 37091
========================================
Hits 176410 176410
Misses 75488 75488
Partials 15567 15567
*This pull request uses carry forward flags. Click here to find out more. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Early spring cleaning for RocRoller.
Technical Details
Test Plan
The only functional changes are to ignore patterns for filenames when running flake8, CodeQL, and codecov.
Test Result
Submission Checklist