fix: refactor kvcache structure and support kernel_block_size by SJTUGavinLiu · Pull Request #724 · alibaba/rtp-llm

SJTUGavinLiu · 2026-02-27T11:52:54Z

Background & Motivation

This PR systematically refactors the KVCache data structure and related attention operator interfaces to address two core issues:

Per-layer view inconsistency in hybrid cache scenarios: The original KVCache struct stored all layers' KV caches as a monolithic tensor. In MHA/MLA hybrid layouts, each layer may have a different shape and cannot be directly indexed by layer id, forcing callers to handle this with ad-hoc logic.
Coupling between KV management granularity and attention kernel granularity: Attention operators (FlashInfer, TRTLLM, etc.) have preferences on block size, but the KV manager's physical block size could not be adjusted independently, limiting optimization opportunities on the kernel side.

Key Changes

1. KVCache Data Structure Refactoring

Introduced a standalone LayerKVCache struct as a per-layer KV cache view, holding kv_cache_base, kv_scale_base, seq_size_per_block, and layer_id for a single layer.
Removed the legacy monolithic kv_cache_base / kv_scale_base tensor fields from KVCache, replacing them with per-layer kv_cache_base_by_layer / kv_scale_base_by_layer vectors. Added metadata fields: num_kv_heads, head_dim, use_mla, kv_lora_rank, rope_head_dim.
getLayerCache(idx) now returns a LayerKVCache, automatically reshaping raw 2D buffers in hybrid cache mode: MHA layers → [block_num, 2, kv_heads, seq_size, head_dim], MLA layers → [block_num, seq_size, lora_rank + rope_dim].

2. Introduce kernel_block_size to Decouple Logical and Kernel Blocks

Added kernel_seq_size_per_block to CacheConfig, configurable via --kernel_seq_size_per_block launch argument or KERNEL_SEQ_SIZE_PER_BLOCK environment variable. When smaller than seq_size_per_block, each physical KV block is split into multiple kernel blocks at the operator level.
Refactored BlockIds with a dual-index mechanism: block_indices (physical blocks) and kernel_block_indices_ (kernel blocks) are always kept in sync. A kernelBlocks() accessor is exposed, and all mutation operations (add(), remove(), swap(), setAt(), resize(), popBack()) automatically update both index arrays, replacing direct raw vector manipulation.
KVCacheResource and BatchKVCacheResource interfaces extended to propagate blocks_per_kv_block and group_types.

rtp_llm/cpp/cache/BatchKVCacheResource.h

rtp_llm/cpp/cache/CacheConfig.h

rtp_llm/cpp/cache/KVCacheResource.cc

xinfei-shi · 2026-03-06T09:35:56Z

rtp_llm/cpp/cache/KVCacheResource.cc

+    const size_t old_size = block_indices.size();
+    block_indices.resize(new_size, value);
+    if (!is_full_) {
+        kernel_block_indices_.resize(new_size, value);


如果在非full的这层，blocks_per_kv_block_ 直接定为 1，是不是就不用每个地方都要进行分支处理了。

是可以的，那我去掉 is_full 这个标记

rtp_llm/cpp/engine_base/stream/StreamCacheResource.cc

xinfei-shi · 2026-03-09T08:23:43Z

rtp_llm/cpp/models/PyWrappedModel.h

+                const auto type = (static_cast<size_t>(gid) < layout.group_types.size()) ?
+                                      layout.group_types[static_cast<size_t>(gid)] :
+                                      rtp_llm::CacheGroupType::FULL;
+                kv_cache.layer_attn_types.push_back(type);


这个是否可以 kv cache manager的get layout接口直接提供 layer_attn_types 这个字段。

和 group_types 一样做到 CacheConfig 理好了，CacheLayout 也加上这个字段。

xinfei-shi · 2026-03-09T08:27:47Z

rtp_llm/models_py/model_desc/module_base.py

-                f"kv_cache_base={self.kv_cache.kv_cache_base.shape if self.kv_cache.kv_cache_base is not None else None}, "
-                f"kv_scale_base={self.kv_cache.kv_scale_base.shape if self.kv_cache.kv_scale_base is not None else None}, "
+                f"num_kv_layers={num_layers}, "
+                f"layer0_kv_cache_shape={layer0_shape}, "


这里还挺奇怪的，专门打印了layer0的shape，要不就别打印了吧，因为不同层的shape还不一样。

测试的时候加的，我检查一下都去掉

rtp_llm/ops/librtp_compute_ops/__init__.pyi

SJTUGavinLiu requested a review from LLLLKKKK as a code owner February 27, 2026 11:52

SJTUGavinLiu force-pushed the develop/chanyin/hybrid_kvcache_fix branch 8 times, most recently from 2c961cf to 787e60d Compare March 4, 2026 06:44

SJTUGavinLiu requested a review from xinfei-shi March 6, 2026 08:42

SJTUGavinLiu changed the title ~~fix: refactor KVCache and add LayerKVCache~~ fix: refactor kvcache structure and support kernel_block_size Mar 6, 2026

SJTUGavinLiu force-pushed the develop/chanyin/hybrid_kvcache_fix branch 4 times, most recently from 39a24cd to 61d985d Compare March 8, 2026 09:41

xinfei-shi reviewed Mar 9, 2026

View reviewed changes

SJTUGavinLiu force-pushed the develop/chanyin/hybrid_kvcache_fix branch 8 times, most recently from c1e2ebc to 8f99312 Compare March 16, 2026 09:00

SJTUGavinLiu added 5 commits March 16, 2026 21:37

fix: refactor KVCache and add LayerKVCache

57b2a17

fix: remove kv_cache_base/kvcache_scale in KVCache

4e5d110

feat: support kernel_block_size for attn_impl

4ac4c73

fix: add ut for kvcache resource

90ae523

fix: kernel_block_size cuda graph support

bfbba99

SJTUGavinLiu force-pushed the develop/chanyin/hybrid_kvcache_fix branch from 8f99312 to 056d1ef Compare March 16, 2026 14:38

SJTUGavinLiu force-pushed the develop/chanyin/hybrid_kvcache_fix branch 3 times, most recently from 29bafc3 to 5c45cc0 Compare March 17, 2026 07:08

fix: kernel block for cache store

5460176

SJTUGavinLiu force-pushed the develop/chanyin/hybrid_kvcache_fix branch from 5c45cc0 to 5460176 Compare March 17, 2026 08:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: refactor kvcache structure and support kernel_block_size#724

fix: refactor kvcache structure and support kernel_block_size#724
SJTUGavinLiu wants to merge 6 commits intomainfrom
develop/chanyin/hybrid_kvcache_fix

SJTUGavinLiu commented Feb 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xinfei-shi Mar 6, 2026

Uh oh!

SJTUGavinLiu Mar 9, 2026

Uh oh!

Uh oh!

Uh oh!

xinfei-shi Mar 9, 2026

Uh oh!

SJTUGavinLiu Mar 9, 2026 •

edited

Loading

Uh oh!

xinfei-shi Mar 9, 2026

Uh oh!

SJTUGavinLiu Mar 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

SJTUGavinLiu commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background & Motivation

Key Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xinfei-shi Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

SJTUGavinLiu Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

xinfei-shi Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

SJTUGavinLiu Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xinfei-shi Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

SJTUGavinLiu Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SJTUGavinLiu commented Feb 27, 2026 •

edited

Loading

SJTUGavinLiu Mar 9, 2026 •

edited

Loading