fix: prevent store cache corruption on retry by snapshotting before flush#2972
fix: prevent store cache corruption on retry by snapshotting before flush#2972joonas wants to merge 5 commits intodefenseunicorns:mainfrom
Conversation
…lush
The retry path in `sendUpdatesAndFlushCache` used `Number(index)` to
look up operations by position in a parallel values array. Cache keys
are non-numeric strings produced by `fillStoreCache` (e.g.
`"add:/data/cap-v2-key:value"`), so `Number(key)` always evaluated to
`NaN`, and `payload[NaN]` produced `undefined`. Every retryable
failure silently replaced all pending operations with `undefined`,
permanently losing them.
The fix snapshots the cache (`{ ...cache }`) before the patch attempt
and restores it with `Object.assign(cache, snapshot)` on retryable
failure, avoiding the key-to-index cross-reference entirely.
The existing retry test was tautological — it compared `result` to
`cache`, but both were the same object reference, so the assertion
always passed regardless of corruption. The updated test snapshots the
expected state into an independent copy and uses realistic keys from
`fillStoreCache` to exercise the actual code path.
Signed-off-by: Joonas Bergius <joonas@defenseunicorns.com>
…e` retry The previous commit replaced `Object.assign(cache, snapshot)` for the retry restore, but that unconditionally overwrote every key in the cache — including entries added by `sender()` during the `await` on the Patch call. If a concurrent write used a colliding key (possible for remove operations targeting the same path), the newer value was silently clobbered by the stale snapshot. Replace `Object.assign` with a loop that only restores entries missing from the cache, skipping keys that already exist. This preserves concurrent writes while still re-queuing the original operations for retry. The same fix is applied to the 422 path, which previously used `Object.keys(cache)` to clear entries — also dropping concurrent writes. Both paths now iterate `Object.keys(snapshot)` consistently. Add a test that simulates a concurrent `sender()` write during the Patch `await` by injecting a new cache entry inside the mock's `Patch` implementation before rejecting. The test verifies that both the concurrent write and the original snapshotted entry survive. Signed-off-by: Joonas Bergius <joonas@defenseunicorns.com>
…s path The existing test suite assumed a static cache during the call. In production, `sender()` adds entries concurrently while the K8s Patch is in flight. The failure-path concurrent-write test was added in the previous commit, but there was no equivalent for the success path. Add a test that hooks the mock Patch to inject a new cache entry via `fillStoreCache` during the `await`, then verifies that the post-success cleanup only removes the snapshotted keys — leaving the concurrently added entry intact. This directly exercises the `Object.keys(snapshot)` cleanup on line 20, proving it does not over-delete entries that arrived after the snapshot was taken. Signed-off-by: Joonas Bergius <joonas@defenseunicorns.com>
cmwylie19
left a comment
There was a problem hiding this comment.
yep, thanks. Nice catch
Greptile SummaryThis PR fixes a silent data-loss bug in Confidence Score: 5/5Safe to merge — fixes a real silent-data-loss bug with correct snapshot semantics and thorough tests. All findings are P2 or lower. The core fix is well-reasoned: snapshotting before flush avoids the NaN index bug, and restoreMissing correctly preserves concurrent writes. The new tests cover the key edge cases (concurrent write on success, concurrent write on retry, non-tautological repopulation check). No regressions or security concerns identified. No files require special attention. Important Files Changed
Sequence DiagramsequenceDiagram
participant Sender as sender()
participant Flush as sendUpdatesAndFlushCache
participant K8s as K8s Patch API
Sender->>Flush: call with cache {key1: op1}
Note over Flush: snapshot = { ...cache }<br/>payload = Object.values(snapshot)
Flush->>K8s: Patch(payload)
Note over Flush,K8s: in-flight — Sender may write<br/>new keys to cache
alt Patch succeeds
K8s-->>Flush: resolved
Note over Flush: delete snapshot keys from cache<br/>(concurrent writes survive)
else 422 Unprocessable Entity
K8s-->>Flush: rejected {status: 422}
Note over Flush: delete snapshot keys from cache<br/>(discard bad ops permanently)
else retryable error (e.g. 500)
K8s-->>Flush: rejected {status: 500}
Note over Flush: restoreMissing(cache, snapshot)<br/>adds snapshot keys only if absent<br/>(concurrent writes for same key survive)
end
Flush-->>Sender: return cache
Reviews (1): Last reviewed commit: "Merge branch 'main' into fix/store-cache..." | Re-trigger Greptile |
Description
The retry path in
sendUpdatesAndFlushCacheusedNumber(index)tolook up operations by position in a parallel values array. Cache keys
are non-numeric strings produced by
fillStoreCache(e.g."add:/data/cap-v2-key:value"), soNumber(key)always evaluated toNaN, andpayload[NaN]producedundefined. Every retryablefailure silently replaced all pending operations with
undefined,permanently losing them.
The fix snapshots the cache (
{ ...cache }) before the patch attemptand restores it with
Object.assign(cache, snapshot)on retryablefailure, avoiding the key-to-index cross-reference entirely.
The existing retry test was tautological — it compared
resulttocache, but both were the same object reference, so the assertionalways passed regardless of corruption. The updated test snapshots the
expected state into an independent copy and uses realistic keys from
fillStoreCacheto exercise the actual code path.End to End Test:
(See Pepr Excellent Examples)
Type of change
Checklist before merging