Skip to content

Conversation

@0cc4m
Copy link
Collaborator

@0cc4m 0cc4m commented Jan 11, 2026

I tuned this on AMD Radeon 8060S, but a brief test also showed improvements on AMD RX 9060 XT.

ggml_vulkan: 0 = Radeon 8060S Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat

model size params ngl test t/s (ROCm) t/s (before) t/s (after) diff
llama 8B Q4_0 4.33 GiB 8.03 B 99 pp512 815.22 ± 16.88 644.84 ± 4.49 875.24 ± 17.99 +35.7%
llama 8B Q4_0 4.33 GiB 8.03 B 99 pp512 @ d8192 321.66 ± 3.02 384.27 ± 0.29 463.72 ± 1.32 +20.7%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 pp512 995.81 ± 32.32 529.50 ± 13.07 793.51 ± 9.67 +49.9%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 pp512 @ d8192 352.08 ± 3.17 341.86 ± 1.96 435.55 ± 2.63 +27.4%
llama 8B Q8_0 7.95 GiB 8.03 B 99 pp512 755.86 ± 21.11 422.37 ± 21.61 742.46 ± 15.49 +75.8%
llama 8B Q8_0 7.95 GiB 8.03 B 99 pp512 @ d8192 317.44 ± 2.31 306.97 ± 0.36 419.24 ± 4.59 +36.6%

ggml_vulkan: 0 = AMD Radeon RX 9060 XT (RADV GFX1200) (radv) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat

model size params ngl test t/s (ROCm) t/s (before) t/s (after) diff
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 pp512 648.41 ± 14.37 1437.13 ± 1.77 1902.23 ± 1.65 +32.4%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 pp512 @ d8192 410.01 ± 6.22 757.59 ± 2.71 841.76 ± 3.02 +11.1%

@0cc4m 0cc4m requested a review from jeffbolznv January 11, 2026 08:27
@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Jan 11, 2026
@daniandtheweb
Copy link
Contributor

There are some small but consistent improvements on RDNA3 as well with this.

model size params backend ngl fa test t/s (ROCm) t/s (before) t/s (after) diff
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 0 pp512 2213.37 ± 25.38 1964.06 ± 17.79 1992.17 ± 24.98 +1.43%
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 1 pp512 2480.36 ± 2.56 2057.06 ± 6.28 2120.18 ± 2.38 +3.06%

@0cc4m
Copy link
Collaborator Author

0cc4m commented Jan 11, 2026

Thank you for testing it! I was hoping for more, I guess it is too different from 8060S. There's probably some more tuning for RDNA3/4 dGPUs that can be done, but I don't have the hardware for that.

@0cc4m 0cc4m merged commit 0e76501 into master Jan 11, 2026
75 of 76 checks passed
@0cc4m 0cc4m deleted the 0cc4m/vulkan-amd-coopmat-opt branch January 11, 2026 16:33
@netrunnereve
Copy link
Collaborator

You should try this for mul_mat_id as well!

@characharm
Copy link
Contributor

characharm commented Jan 11, 2026

9070xt

model test master t/s (±) PR t/s (±) diff t/s diff %
gpt-oss 20B MXFP4 MoE pp512 4646.61 ± 74.94 3464.81 ± 61.21 -1181.80 -25.4%
gpt-oss 20B MXFP4 MoE tg128 175.24 ± 0.50 178.45 ± 1.65 +3.21 +1.8%
gpt-oss 20B MXFP4 MoE pp512 @ d8192 2077.60 ± 19.08 1688.42 ± 15.70 -389.18 -18.7%
gpt-oss 20B MXFP4 MoE tg128 @ d8192 149.41 ± 0.73 149.97 ± 0.50 +0.56 +0.37%

@0cc4m
Copy link
Collaborator Author

0cc4m commented Jan 11, 2026

You should try this for mul_mat_id as well!

I tried, but didn't find a good parameter set yet. I'll keep trying.

@characharm Is that Windows or Linux?

@characharm
Copy link
Contributor

@0cc4m Windows. I rebooted between tests for accuracy. The numbers are stable.

@0cc4m
Copy link
Collaborator Author

0cc4m commented Jan 11, 2026

Thank you for testing, I wish the drivers would behave more similarly. I'll disable the change on Windows.

@0cc4m
Copy link
Collaborator Author

0cc4m commented Jan 11, 2026

Can you also test a dense model, though? Those should be more affected than MoE.

@characharm
Copy link
Contributor

model test master t/s (±) PR t/s (±) diff t/s diff %
llama 8B Q4_0 pp512 @ d8192 1659.33 ± 20.44 588.10 ± 1.59 -1071.23 -64.56%
llama 8B Q4_0 tg128 @ d8192 87.20 ± 0.48 87.20 ± 0.92 0.00 0.0%

@0cc4m
Copy link
Collaborator Author

0cc4m commented Jan 11, 2026

@characharm Please check if #18763 restores your performance.

@characharm
Copy link
Contributor

characharm commented Jan 11, 2026

No, the performance is the same. I compared it with the CI build, so the problem is not in my build.
clarify:
The driver exclusion fix didn't work. I thought it might be a local build issue, but I verified with b7707 and got the same results.

@jeffbolznv
Copy link
Collaborator

I think these tile sizes are also used for matmul id in some cases, so that could explain the effect on gpt-oss.

@0cc4m
Copy link
Collaborator Author

0cc4m commented Jan 11, 2026

@characharm Sorry, I missed disabling the large tile size. Try again, please.

I think these tile sizes are also used for matmul id in some cases, so that could explain the effect on gpt-oss.

No, I didn't enable the large tile for mul_mat_id, so unless the check somewhere is wrong, it should not be used at all.

@characharm
Copy link
Contributor

@0cc4m Yes, now dens and moe show the same numbers as before 18749.

@acbits
Copy link

acbits commented Jan 12, 2026

Not sure whether this helps, but Vulkan performance on RX 7600 has been going down. I don't use such a big model on this GPU, but interesting that it has degraded.

model size params backend ngl threads type_k type_v fa test t/s
qwen3 14B Q4_K - Medium 8.53 GiB 14.77 B Vulkan 99 8 q8_0 q8_0 1 pp512 12.76 ± 0.00
qwen3 14B Q4_K - Medium 8.53 GiB 14.77 B Vulkan 99 8 q8_0 q8_0 1 tg128 1.35 ± 0.00

build: 7d77f07 (7108)

model size params backend ngl type_k type_v fa test t/s
qwen3 14B Q4_K - Medium 8.53 GiB 14.77 B Vulkan 99 q8_0 q8_0 1 pp512 4.67 ± 0.00
qwen3 14B Q4_K - Medium 8.53 GiB 14.77 B Vulkan 99 q8_0 q8_0 1 tg128 1.35 ± 0.00

build: c9ced49 (7710)

@0cc4m
Copy link
Collaborator Author

0cc4m commented Jan 12, 2026

Can you test a model that actually fits into your GPU? That would likely give more usable data.

@acbits
Copy link

acbits commented Jan 12, 2026

Can you test a model that actually fits into your GPU? That would likely give more usable data.

Luckily, I had copied the results from an old build. Yeah, even for smaller models, there has been degradation. Not sure whether kernel upgrade played a role.

Kernel: 6.6 (don't remember the exact patch)

model size params backend ngl threads type_k type_v fa test t/s
qwen3 8B Q5_K - Medium 5.44 GiB 8.19 B Vulkan 99 8 q8_0 q8_0 1 pp512 498.47 ± 0.00
qwen3 8B Q5_K - Medium 5.44 GiB 8.19 B Vulkan 99 8 q8_0 q8_0 1 tg128 34.83 ± 0.00

build: dd5e8ca (6916)

Kernel: 6.17.9-200

model size params backend ngl type_k type_v fa test t/s
qwen3 8B Q5_K - Medium 5.44 GiB 8.19 B Vulkan 99 q8_0 q8_0 1 pp512 240.94 ± 0.00
qwen3 8B Q5_K - Medium 5.44 GiB 8.19 B Vulkan 99 q8_0 q8_0 1 tg128 12.13 ± 0.00

build: c9ced49 (7710)

@0cc4m
Copy link
Collaborator Author

0cc4m commented Jan 13, 2026

Can you add more information about your setup? What OS, what driver, what does your device info string say, etc?

gary149 pushed a commit to gary149/llama-agent that referenced this pull request Jan 13, 2026
…gml-org#18749)

* vulkan: Enable and optimize large matmul parameter combination for AMD

* limit tuning to AMD GPUs with coopmat support

* use tx_m values instead of _l
@acbits
Copy link

acbits commented Jan 13, 2026

Can you add more information about your setup? What OS, what driver, what does your device info string say, etc?

OS: Fedora 42
Kernel: 6.17.9-200
MESA: 25.1.9
RX7600-vulkaninfo.json

@0cc4m
Copy link
Collaborator Author

0cc4m commented Jan 13, 2026

My guess would be that your driver is too old, for good Mesa coopmat performance you usually want 25.3 or higher. But I didn't want to cause an issue for older versions.

@acbits
Copy link

acbits commented Jan 13, 2026

My guess would be that your driver is too old, for good Mesa coopmat performance you usually want 25.3 or higher. But I didn't want to cause an issue for older versions.

25.1.9 is the latest. No updates are available for Fedora 42.

@Nindaleth
Copy link
Contributor

25.1.9, despite being the newest for Fedora 42, is not good enough. If upgrade to Fedora 43 is not an option for you at the moment, you could try a newer mesa build from the che/mesa COPR repo.

For example a merge request providing a significant PP speed improvement was merged into Mesa repo in August and is available since release 25.2.x or 25.3.x (not sure here).

dillon-blake pushed a commit to Boxed-Logic/llama.cpp that referenced this pull request Jan 15, 2026
…gml-org#18749)

* vulkan: Enable and optimize large matmul parameter combination for AMD

* limit tuning to AMD GPUs with coopmat support

* use tx_m values instead of _l
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants