fix: prevent store cache corruption on retry by snapshotting before flush by joonas · Pull Request #2972 · defenseunicorns/pepr

joonas · 2026-02-28T22:13:11Z

Description

The retry path in sendUpdatesAndFlushCache used Number(index) to
look up operations by position in a parallel values array. Cache keys
are non-numeric strings produced by fillStoreCache (e.g.
"add:/data/cap-v2-key:value"), so Number(key) always evaluated to
NaN, and payload[NaN] produced undefined. Every retryable
failure silently replaced all pending operations with undefined,
permanently losing them.

The fix snapshots the cache ({ ...cache }) before the patch attempt
and restores it with Object.assign(cache, snapshot) on retryable
failure, avoiding the key-to-index cross-reference entirely.

The existing retry test was tautological — it compared result to
cache, but both were the same object reference, so the assertion
always passed regardless of corruption. The updated test snapshots the
expected state into an independent copy and uses realistic keys from
fillStoreCache to exercise the actual code path.

End to End Test:
(See Pepr Excellent Examples)

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Other (security config, docs update, etc)

Checklist before merging

Unit, Integration, docs, adr added or updated as needed
Contributor Guide Steps followed

…lush The retry path in `sendUpdatesAndFlushCache` used `Number(index)` to look up operations by position in a parallel values array. Cache keys are non-numeric strings produced by `fillStoreCache` (e.g. `"add:/data/cap-v2-key:value"`), so `Number(key)` always evaluated to `NaN`, and `payload[NaN]` produced `undefined`. Every retryable failure silently replaced all pending operations with `undefined`, permanently losing them. The fix snapshots the cache (`{ ...cache }`) before the patch attempt and restores it with `Object.assign(cache, snapshot)` on retryable failure, avoiding the key-to-index cross-reference entirely. The existing retry test was tautological — it compared `result` to `cache`, but both were the same object reference, so the assertion always passed regardless of corruption. The updated test snapshots the expected state into an independent copy and uses realistic keys from `fillStoreCache` to exercise the actual code path. Signed-off-by: Joonas Bergius <joonas@defenseunicorns.com>

…e` retry The previous commit replaced `Object.assign(cache, snapshot)` for the retry restore, but that unconditionally overwrote every key in the cache — including entries added by `sender()` during the `await` on the Patch call. If a concurrent write used a colliding key (possible for remove operations targeting the same path), the newer value was silently clobbered by the stale snapshot. Replace `Object.assign` with a loop that only restores entries missing from the cache, skipping keys that already exist. This preserves concurrent writes while still re-queuing the original operations for retry. The same fix is applied to the 422 path, which previously used `Object.keys(cache)` to clear entries — also dropping concurrent writes. Both paths now iterate `Object.keys(snapshot)` consistently. Add a test that simulates a concurrent `sender()` write during the Patch `await` by injecting a new cache entry inside the mock's `Patch` implementation before rejecting. The test verifies that both the concurrent write and the original snapshotted entry survive. Signed-off-by: Joonas Bergius <joonas@defenseunicorns.com>

…s path The existing test suite assumed a static cache during the call. In production, `sender()` adds entries concurrently while the K8s Patch is in flight. The failure-path concurrent-write test was added in the previous commit, but there was no equivalent for the success path. Add a test that hooks the mock Patch to inject a new cache entry via `fillStoreCache` during the `await`, then verifies that the post-success cleanup only removes the snapshotted keys — leaving the concurrently added entry intact. This directly exercises the `Object.keys(snapshot)` cleanup on line 20, proving it does not over-delete entries that arrived after the snapshot was taken. Signed-off-by: Joonas Bergius <joonas@defenseunicorns.com>

cmwylie19

yep, thanks. Nice catch

samayer12 · 2026-04-22T14:44:59Z

@greptileai

greptile-apps · 2026-04-22T14:47:06Z

Greptile Summary

This PR fixes a silent data-loss bug in sendUpdatesAndFlushCache where the retry path used Number(index) on non-numeric string keys (e.g. "add:/data/cap-v2-key:value"), always yielding NaN and clobbering every pending operation with undefined. The fix snapshots the cache before the patch attempt and uses a restoreMissing helper on retryable failure, correctly preserving any concurrent writes that arrived while the patch was in flight. Test coverage is substantially improved, including a new test that catches the formerly-tautological assertion.

Confidence Score: 5/5

Safe to merge — fixes a real silent-data-loss bug with correct snapshot semantics and thorough tests.

All findings are P2 or lower. The core fix is well-reasoned: snapshotting before flush avoids the NaN index bug, and restoreMissing correctly preserves concurrent writes. The new tests cover the key edge cases (concurrent write on success, concurrent write on retry, non-tautological repopulation check). No regressions or security concerns identified.

No files require special attention.

Important Files Changed

Filename	Overview
src/lib/controller/storeCache.ts	Replaces broken index-based retry with snapshot/restoreMissing approach; logic is correct and handles concurrent writes on both success and retryable-failure paths.
src/lib/controller/storeCache.test.ts	Adds three new scenario tests (concurrent write on success, concurrent write on retry, realistic-key repopulation) and fixes the previously tautological same-reference assertion.

Sequence Diagram

sequenceDiagram
    participant Sender as sender()
    participant Flush as sendUpdatesAndFlushCache
    participant K8s as K8s Patch API

    Sender->>Flush: call with cache {key1: op1}
    Note over Flush: snapshot = { ...cache }<br/>payload = Object.values(snapshot)
    Flush->>K8s: Patch(payload)
    Note over Flush,K8s: in-flight — Sender may write<br/>new keys to cache

    alt Patch succeeds
        K8s-->>Flush: resolved
        Note over Flush: delete snapshot keys from cache<br/>(concurrent writes survive)
    else 422 Unprocessable Entity
        K8s-->>Flush: rejected {status: 422}
        Note over Flush: delete snapshot keys from cache<br/>(discard bad ops permanently)
    else retryable error (e.g. 500)
        K8s-->>Flush: rejected {status: 500}
        Note over Flush: restoreMissing(cache, snapshot)<br/>adds snapshot keys only if absent<br/>(concurrent writes for same key survive)
    end

    Flush-->>Sender: return cache

_{Reviews (1): Last reviewed commit: "Merge branch 'main' into fix/store-cache..." | Re-trigger Greptile}

joonas added 3 commits February 28, 2026 15:59

joonas requested a review from a team as a code owner February 28, 2026 22:13

github-project-automation Bot added this to Pepr Project Board Feb 28, 2026

cmwylie19 approved these changes Feb 28, 2026

View reviewed changes

github-project-automation Bot moved this to 👀 In review in Pepr Project Board Feb 28, 2026

AmberFryar and others added 2 commits March 30, 2026 17:09

Merge branch 'main' into fix/store-cache-retry-nan-corruption

8f1843f

Merge branch 'main' into fix/store-cache-retry-nan-corruption

8a261be

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prevent store cache corruption on retry by snapshotting before flush#2972

fix: prevent store cache corruption on retry by snapshotting before flush#2972
joonas wants to merge 5 commits intodefenseunicorns:mainfrom
joonas:fix/store-cache-retry-nan-corruption

joonas commented Feb 28, 2026

Uh oh!

cmwylie19 left a comment

Uh oh!

samayer12 commented Apr 22, 2026

Uh oh!

greptile-apps Bot commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

joonas commented Feb 28, 2026

Description

Type of change

Checklist before merging

Uh oh!

cmwylie19 left a comment

Choose a reason for hiding this comment

Uh oh!

samayer12 commented Apr 22, 2026

Uh oh!

greptile-apps Bot commented Apr 22, 2026

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants