vulkan: Use VK_EXT_shader_64bit_indexing to handle large mat_mul(_id) by jeffbolznv · Pull Request #18678 · ggml-org/llama.cpp

jeffbolznv · 2026-01-07T20:56:29Z

This fixes incoherent output in Llama-4-Maverick-17B-128E-PAB-Q8_0, which has a mul_mat_id with an A matrix that's Q8_0 8192 x 5120 x 128.

This should work when the number of blocks in the A matrix is less than 2^32 (for mul_mat_vec or mul_mm_cm2), or for mul_mm I think the limit is like 2^32*LOAD_VEC_A elements.

Divide batch_stride by QUANT_K earlier, so the block index calculation works in 32b.
Each vk_pipeline_struct has a linked list of pipelines that will allow it to handle variants. So far this change just adds a single use case for this, compiling with the e64BitIndexingEXT flag.
Use the 64b indexing variant when the A matrix is larger than maxStorageBufferRange.

64-bit indexing has some cost - around 3-5% in MoE models, so it's worth the effort to avoid enabling it unconditionally.

This fixes incoherent output in Llama-4-Maverick-17B-128E-PAB-Q8_0, which has a mul_mat_id with an A matrix that's Q8_0 8192 x 5120 x 128. This should work when the number of blocks in the A matrix is less than 2^32 (for mul_mat_vec or mul_mm_cm2), or for mul_mm I think the limit is like 2^32*LOAD_VEC_A elements. - Divide batch_stride by QUANT_K earlier, so the block index calculation works in 32b. - Each vk_pipeline_struct has a linked list of pipelines that will allow it to handle variants. So far this change just adds a single use case for this, compiling with the e64BitIndexingEXT flag. - Use the 64b indexing variant when the A matrix is larger than maxStorageBufferRange. 64-bit indexing has some cost - around 3-5% in MoE models, so it's worth the effort to avoid enabling it unconditionally.

0cc4m

LGTM

…ggml-org#18678) This fixes incoherent output in Llama-4-Maverick-17B-128E-PAB-Q8_0, which has a mul_mat_id with an A matrix that's Q8_0 8192 x 5120 x 128. This should work when the number of blocks in the A matrix is less than 2^32 (for mul_mat_vec or mul_mm_cm2), or for mul_mm I think the limit is like 2^32*LOAD_VEC_A elements. - Divide batch_stride by QUANT_K earlier, so the block index calculation works in 32b. - Each vk_pipeline_struct has a linked list of pipelines that will allow it to handle variants. So far this change just adds a single use case for this, compiling with the e64BitIndexingEXT flag. - Use the 64b indexing variant when the A matrix is larger than maxStorageBufferRange. 64-bit indexing has some cost - around 3-5% in MoE models, so it's worth the effort to avoid enabling it unconditionally.

jeffbolznv requested review from 0cc4m and ggerganov as code owners January 7, 2026 20:56

github-actions bot added testing Everything test related Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Jan 7, 2026

0cc4m approved these changes Jan 12, 2026

View reviewed changes

0cc4m merged commit 2bbe4c2 into ggml-org:master Jan 12, 2026
75 checks passed

ThiloteE mentioned this pull request Feb 11, 2026

Misc. bug: Vulcan premature out of memory exception on AMD Instinct MI60 #11598

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vulkan: Use VK_EXT_shader_64bit_indexing to handle large mat_mul(_id)#18678

vulkan: Use VK_EXT_shader_64bit_indexing to handle large mat_mul(_id)#18678
0cc4m merged 1 commit intoggml-org:masterfrom
jeffbolznv:64b_indexing

jeffbolznv commented Jan 7, 2026

Uh oh!

0cc4m left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jeffbolznv commented Jan 7, 2026

Uh oh!

0cc4m left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants