fix: refactor kvcache structure and support kernel_block_size#724
Open
SJTUGavinLiu wants to merge 6 commits intomainfrom
Open
fix: refactor kvcache structure and support kernel_block_size#724SJTUGavinLiu wants to merge 6 commits intomainfrom
SJTUGavinLiu wants to merge 6 commits intomainfrom
Conversation
2c961cf to
787e60d
Compare
39a24cd to
61d985d
Compare
xinfei-shi
reviewed
Mar 9, 2026
rtp_llm/cpp/cache/KVCacheResource.cc
Outdated
| const size_t old_size = block_indices.size(); | ||
| block_indices.resize(new_size, value); | ||
| if (!is_full_) { | ||
| kernel_block_indices_.resize(new_size, value); |
Collaborator
There was a problem hiding this comment.
如果在非full的这层,blocks_per_kv_block_ 直接定为 1,是不是就不用每个地方都要进行 分支处理了。
Collaborator
Author
There was a problem hiding this comment.
是可以的,那我去掉 is_full 这个标记
rtp_llm/cpp/models/PyWrappedModel.h
Outdated
| const auto type = (static_cast<size_t>(gid) < layout.group_types.size()) ? | ||
| layout.group_types[static_cast<size_t>(gid)] : | ||
| rtp_llm::CacheGroupType::FULL; | ||
| kv_cache.layer_attn_types.push_back(type); |
Collaborator
There was a problem hiding this comment.
这个是否可以 kv cache manager的get layout接口直接提供 layer_attn_types 这个字段。
Collaborator
Author
There was a problem hiding this comment.
和 group_types 一样做到 CacheConfig 理好了,CacheLayout 也加上这个字段。
| f"kv_cache_base={self.kv_cache.kv_cache_base.shape if self.kv_cache.kv_cache_base is not None else None}, " | ||
| f"kv_scale_base={self.kv_cache.kv_scale_base.shape if self.kv_cache.kv_scale_base is not None else None}, " | ||
| f"num_kv_layers={num_layers}, " | ||
| f"layer0_kv_cache_shape={layer0_shape}, " |
Collaborator
There was a problem hiding this comment.
这里还挺奇怪的,专门打印了layer0的shape,要不就别打印了吧,因为不同层的shape还不一样。
Collaborator
Author
There was a problem hiding this comment.
测试的时候加的,我检查一下都去掉
c1e2ebc to
8f99312
Compare
8f99312 to
056d1ef
Compare
29bafc3 to
5c45cc0
Compare
5c45cc0 to
5460176
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Background & Motivation
This PR systematically refactors the KVCache data structure and related attention operator interfaces to address two core issues:
KVCachestruct stored all layers' KV caches as a monolithic tensor. In MHA/MLA hybrid layouts, each layer may have a different shape and cannot be directly indexed by layer id, forcing callers to handle this with ad-hoc logic.Key Changes
1. KVCache Data Structure Refactoring
LayerKVCachestruct as a per-layer KV cache view, holdingkv_cache_base,kv_scale_base,seq_size_per_block, andlayer_idfor a single layer.kv_cache_base/kv_scale_basetensor fields fromKVCache, replacing them with per-layerkv_cache_base_by_layer/kv_scale_base_by_layervectors. Added metadata fields:num_kv_heads,head_dim,use_mla,kv_lora_rank,rope_head_dim.getLayerCache(idx)now returns aLayerKVCache, automatically reshaping raw 2D buffers in hybrid cache mode: MHA layers →[block_num, 2, kv_heads, seq_size, head_dim], MLA layers →[block_num, seq_size, lora_rank + rope_dim].2. Introduce
kernel_block_sizeto Decouple Logical and Kernel Blockskernel_seq_size_per_blocktoCacheConfig, configurable via--kernel_seq_size_per_blocklaunch argument orKERNEL_SEQ_SIZE_PER_BLOCKenvironment variable. When smaller thanseq_size_per_block, each physical KV block is split into multiple kernel blocks at the operator level.BlockIdswith a dual-index mechanism:block_indices(physical blocks) andkernel_block_indices_(kernel blocks) are always kept in sync. AkernelBlocks()accessor is exposed, and all mutation operations (add(),remove(),swap(),setAt(),resize(),popBack()) automatically update both index arrays, replacing direct raw vector manipulation.KVCacheResourceandBatchKVCacheResourceinterfaces extended to propagateblocks_per_kv_blockandgroup_types.