-
Notifications
You must be signed in to change notification settings - Fork 575
Description
Describe your feature request
Disk (L3) file eviction lacks metadata synchronization and uses a hardcoded FIFO strategy
Background
mooncake-store supports offloading KV Cache to local disk (L3) as a persistence layer. When disk quota is exceeded, StorageBackend evicts files to reclaim space. We identified two issues in the current eviction mechanism that affect read-path correctness and cache efficiency.
Problem
1. Evicted files leave stale metadata on Master
When StorageBackend::EvictFile() removes a file from disk, it only handles the local side — the file is deleted and the tracking queue is updated, but Master is never notified:
EvictFile()
→ fs::remove(path)
→ RemoveFileFromWriteQueue(path)
→ ReleaseSpace(file_size)
// ← Master still holds a DISK replica record for this key
This creates a metadata inconsistency. Master believes the DISK replica is still valid, so when a Get request arrives for that key, it may route the read to the evicted disk path. The client then hits FILE_OPEN_FAIL, wastes an RPC round-trip and a disk I/O attempt, and has to fall back to re-prefill. In the worst case — where the memory replica has already been evicted by Master's own memory eviction — the object becomes a "ghost entry": metadata exists but no valid replica backs it.
2. Hardcoded FIFO eviction strategy
The eviction logic is hardcoded as FIFO via SelectFileToEvictByFIFO(), using a plain std::list<FileRecord> ordered by insertion time. Meanwhile, LoadObject does not record any access information, so the eviction decision is purely based on write order with no awareness of actual usage.
For KV Cache workloads this is suboptimal. In typical LLM serving scenarios, prefill-generated KV blocks that are reused across multiple decode steps (e.g., system prompt caches, shared prefix caches) should be retained longer, while one-shot KV blocks from completed requests should be evicted first. FIFO treats them equally and may evict hot data prematurely, leading to unnecessary re-computation.
Proposed Approach
1. Metadata sync via a new RemoveDiskReplica API
We propose adding a callback mechanism in StorageBackend: when EvictFile() successfully removes a file, it invokes a registered callback that notifies Master via a new RemoveDiskReplica(key) RPC.
We believe this should be a dedicated API rather than reusing the existing Remove, because the semantics are fundamentally different:
Remove(key) |
RemoveDiskReplica(key) |
|
|---|---|---|
| Scope | Deletes all replicas (memory + disk) and the metadata entry | Only erases DISK-type replicas |
| Memory replicas | Removed | Preserved |
| Metadata cleanup | Always deleted | Only if no valid replicas remain |
| Trigger | User-facing operation | Internal, triggered by storage layer capacity management |
The core principle: disk eviction is a local capacity decision and should not invalidate memory replicas that are still actively serving reads. Only when removing the disk replica leaves no valid replicas at all should the metadata entry be cleaned up.
To support this, FileRecord needs to carry the original object key (not just the file path), so the eviction callback knows which key to report to Master. This means StoreObject would accept a key parameter and pass it through to the tracking queue.
The proposed notification flow:
StorageBackend::EvictFile()
→ fs::remove(path)
→ eviction_callback_(key, path, size) // new: trigger callback
→ Client → MasterClient::RemoveDiskReplica(key) // new: RPC to Master
→ MasterService::RemoveDiskReplica(key) // new: erase DISK replicas
2. Pluggable eviction strategy
We propose abstracting the eviction logic behind a DiskEvictionStrategy interface, replacing the hardcoded FIFO queue:
class DiskEvictionStrategy {
public:
virtual void AddFile(const std::string& path, uint64_t size, const std::string& key) = 0;
virtual void RemoveFile(const std::string& path) = 0;
virtual void RecordAccess(const std::string& path) = 0;
virtual FileRecord SelectFileToEvict() = 0;
virtual ~DiskEvictionStrategy() = default;
};This allows different eviction policies to be swapped in via configuration (e.g., eviction_policy = "LRU" in FilePerKeyConfig):
- FIFO: Current behavior, kept for backward compatibility.
- LRU: Evicts the least recently accessed file.
LoadObjectwould callRecordAccess(path)on each read so the strategy can track access recency. This is a better fit for KV Cache workloads with temporal locality.
The strategy is selected at StorageBackend::Init time via a factory, and all eviction operations (AddFile, RemoveFile, SelectFileToEvict) are delegated to the strategy instance. The rest of StorageBackend remains unchanged.
Open Questions
-
Sync vs. async notification: Our current implementation calls
RemoveDiskReplicasynchronously in the eviction path. This is simple and guarantees immediate consistency, but may block the write path if Master is slow or unreachable. An alternative is an async notification queue with retry, achieving eventual consistency at the cost of a short window where stale metadata may still be served. Which approach does the community prefer? -
Default policy: Should we keep FIFO as the default for backward compatibility and let users opt into LRU, or switch the default to LRU?
-
Scope: We have a working implementation for both improvements. Happy to split into separate PRs (metadata sync first, then strategy abstraction) if the community prefers incremental review.
We'd appreciate the community's feedback on the problem analysis and proposed direction. We can submit a PR once we align.
Before submitting a new issue...
- Make sure you already searched for relevant issues and read the documentation