Skip to content

fix: prevent store cache corruption on retry by snapshotting before flush#2972

Open
joonas wants to merge 5 commits intodefenseunicorns:mainfrom
joonas:fix/store-cache-retry-nan-corruption
Open

fix: prevent store cache corruption on retry by snapshotting before flush#2972
joonas wants to merge 5 commits intodefenseunicorns:mainfrom
joonas:fix/store-cache-retry-nan-corruption

Conversation

@joonas
Copy link
Copy Markdown
Member

@joonas joonas commented Feb 28, 2026

Description

The retry path in sendUpdatesAndFlushCache used Number(index) to
look up operations by position in a parallel values array. Cache keys
are non-numeric strings produced by fillStoreCache (e.g.
"add:/data/cap-v2-key:value"), so Number(key) always evaluated to
NaN, and payload[NaN] produced undefined. Every retryable
failure silently replaced all pending operations with undefined,
permanently losing them.

The fix snapshots the cache ({ ...cache }) before the patch attempt
and restores it with Object.assign(cache, snapshot) on retryable
failure, avoiding the key-to-index cross-reference entirely.

The existing retry test was tautological — it compared result to
cache, but both were the same object reference, so the assertion
always passed regardless of corruption. The updated test snapshots the
expected state into an independent copy and uses realistic keys from
fillStoreCache to exercise the actual code path.

End to End Test:
(See Pepr Excellent Examples)

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Other (security config, docs update, etc)

Checklist before merging

…lush

The retry path in `sendUpdatesAndFlushCache` used `Number(index)` to
look up operations by position in a parallel values array. Cache keys
are non-numeric strings produced by `fillStoreCache` (e.g.
`"add:/data/cap-v2-key:value"`), so `Number(key)` always evaluated to
`NaN`, and `payload[NaN]` produced `undefined`. Every retryable
failure silently replaced all pending operations with `undefined`,
permanently losing them.

The fix snapshots the cache (`{ ...cache }`) before the patch attempt
and restores it with `Object.assign(cache, snapshot)` on retryable
failure, avoiding the key-to-index cross-reference entirely.

The existing retry test was tautological — it compared `result` to
`cache`, but both were the same object reference, so the assertion
always passed regardless of corruption. The updated test snapshots the
expected state into an independent copy and uses realistic keys from
`fillStoreCache` to exercise the actual code path.

Signed-off-by: Joonas Bergius <joonas@defenseunicorns.com>
…e` retry

The previous commit replaced `Object.assign(cache, snapshot)` for the
retry restore, but that unconditionally overwrote every key in the
cache — including entries added by `sender()` during the `await` on
the Patch call. If a concurrent write used a colliding key (possible
for remove operations targeting the same path), the newer value was
silently clobbered by the stale snapshot.

Replace `Object.assign` with a loop that only restores entries missing
from the cache, skipping keys that already exist. This preserves
concurrent writes while still re-queuing the original operations for
retry.

The same fix is applied to the 422 path, which previously used
`Object.keys(cache)` to clear entries — also dropping concurrent
writes. Both paths now iterate `Object.keys(snapshot)` consistently.

Add a test that simulates a concurrent `sender()` write during the
Patch `await` by injecting a new cache entry inside the mock's
`Patch` implementation before rejecting. The test verifies that both
the concurrent write and the original snapshotted entry survive.

Signed-off-by: Joonas Bergius <joonas@defenseunicorns.com>
…s path

The existing test suite assumed a static cache during the call. In
production, `sender()` adds entries concurrently while the K8s Patch
is in flight. The failure-path concurrent-write test was added in the
previous commit, but there was no equivalent for the success path.

Add a test that hooks the mock Patch to inject a new cache entry via
`fillStoreCache` during the `await`, then verifies that the
post-success cleanup only removes the snapshotted keys — leaving the
concurrently added entry intact. This directly exercises the
`Object.keys(snapshot)` cleanup on line 20, proving it does not
over-delete entries that arrived after the snapshot was taken.

Signed-off-by: Joonas Bergius <joonas@defenseunicorns.com>
Copy link
Copy Markdown
Contributor

@cmwylie19 cmwylie19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, thanks. Nice catch

@github-project-automation github-project-automation Bot moved this to 👀 In review in Pepr Project Board Feb 28, 2026
@samayer12
Copy link
Copy Markdown
Contributor

@greptileai

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 22, 2026

Greptile Summary

This PR fixes a silent data-loss bug in sendUpdatesAndFlushCache where the retry path used Number(index) on non-numeric string keys (e.g. "add:/data/cap-v2-key:value"), always yielding NaN and clobbering every pending operation with undefined. The fix snapshots the cache before the patch attempt and uses a restoreMissing helper on retryable failure, correctly preserving any concurrent writes that arrived while the patch was in flight. Test coverage is substantially improved, including a new test that catches the formerly-tautological assertion.

Confidence Score: 5/5

Safe to merge — fixes a real silent-data-loss bug with correct snapshot semantics and thorough tests.

All findings are P2 or lower. The core fix is well-reasoned: snapshotting before flush avoids the NaN index bug, and restoreMissing correctly preserves concurrent writes. The new tests cover the key edge cases (concurrent write on success, concurrent write on retry, non-tautological repopulation check). No regressions or security concerns identified.

No files require special attention.

Important Files Changed

Filename Overview
src/lib/controller/storeCache.ts Replaces broken index-based retry with snapshot/restoreMissing approach; logic is correct and handles concurrent writes on both success and retryable-failure paths.
src/lib/controller/storeCache.test.ts Adds three new scenario tests (concurrent write on success, concurrent write on retry, realistic-key repopulation) and fixes the previously tautological same-reference assertion.

Sequence Diagram

sequenceDiagram
    participant Sender as sender()
    participant Flush as sendUpdatesAndFlushCache
    participant K8s as K8s Patch API

    Sender->>Flush: call with cache {key1: op1}
    Note over Flush: snapshot = { ...cache }<br/>payload = Object.values(snapshot)
    Flush->>K8s: Patch(payload)
    Note over Flush,K8s: in-flight — Sender may write<br/>new keys to cache

    alt Patch succeeds
        K8s-->>Flush: resolved
        Note over Flush: delete snapshot keys from cache<br/>(concurrent writes survive)
    else 422 Unprocessable Entity
        K8s-->>Flush: rejected {status: 422}
        Note over Flush: delete snapshot keys from cache<br/>(discard bad ops permanently)
    else retryable error (e.g. 500)
        K8s-->>Flush: rejected {status: 500}
        Note over Flush: restoreMissing(cache, snapshot)<br/>adds snapshot keys only if absent<br/>(concurrent writes for same key survive)
    end

    Flush-->>Sender: return cache
Loading

Reviews (1): Last reviewed commit: "Merge branch 'main' into fix/store-cache..." | Re-trigger Greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 👀 In review

Development

Successfully merging this pull request may close these issues.

4 participants