feat: add resume support for R2 cache population#1175
feat: add resume support for R2 cache population#1175beobungbu wants to merge 1 commit intoopennextjs:mainfrom
Conversation
When populating a large R2 cache (100K+ entries), the process can fail
due to transient 502/503 errors from the worker proxy. Currently, a
failure requires restarting from entry 0, wasting hours of upload time.
This adds a lightweight resume mechanism:
- Progress directory: /tmp/opennext-cache-{worker}-{buildId}/
- total.txt: entry count (detects stale state from different builds)
- failed-at.txt: index of the entry that exhausted retries
- On fatal error (retry exhaustion), the failed entry index is saved.
- On restart, entries before that index are skipped.
- On success, the progress directory is deleted.
- Stale directories from previous builds are auto-cleaned.
Zero overhead during normal upload (no file writes per entry).
Only writes to disk on crash (1 write) and on resume check (1 read).
Fixes opennextjs#1173
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
@beobungbu how often does the upload crashes? I'm not against adding resume support BUT it will not solve uploading on CI which is the most frequent use case for deployment. Maybe what we can first do is to
What do you think about that plan. Could you please confirm the numbers you usually see: upload time for how many assets. 60+min for 100k files sounds like slow to me. Thanks |
Summary
When populating a large R2 incremental cache (100K+ entries), the local worker proxy encounters intermittent 502 Bad Gateway errors. After 5 retry attempts on a single entry, the entire process crashes — requiring a full restart from entry 0. For a 180K-entry cache, this means losing 60+ minutes of upload progress.
This PR adds a lightweight resume mechanism so that
populateCachecan pick up from where it left off after a crash.How it works
Progress directory:
/tmp/opennext-cache-{workerName}-{buildId}/total.txtfailed-at.txtResume flow:
Stale detection:
total.txtmust match current build → content changes detectedDesign decisions
/tmpTesting
Tested with a 180K-entry Next.js app (13 locales × ~14K administrative entities):
failed-at.txtwrittenRelated