Conversation
The importer was crash-looping because OpenAlex returned mesh entries with null primary key columns. After dedup filtering removed all rows, an empty array was passed to pgp.helpers.insert() which throws. - Add empty-array guards after dedup in updateWorksMesh, updateWorksConcepts, updateWorksTopics - Make cron error handler resilient (log + retry instead of crash) - Add npm test gate to Dockerfile so broken code can't deploy - Add realistic Work fixtures covering null PKs, empty arrays, missing locations, and other real API edge cases - Add transformer tests (13) and saveData integration tests (8) - Add dedup edge-case tests for all-null-PK scenario (3) Made-with: Cursor
…Data tests Remove dead insertCalls array and unnecessary mock wrapper per CR feedback. The tests use the real pgp.helpers for SQL generation with a mock transaction. Made-with: Cursor
- Daily digest cron (default 9:00 UTC) reports sync position, days imported, duration, failed batches, and days-behind status - Error notifications rate-limited to 1/hour to avoid crash-loop spam - "Caught up" notification fires only on state transition (importing → idle), not on every cron tick - Digest queries the batch table directly so stats survive pod restarts - Gracefully no-ops when TELEGRAM_BOT_TOKEN / TELEGRAM_CHAT_ID are not set (opt-in via SOPS secret) - 19 new tests covering send, rate limiting, dedup, error fallback Env vars to configure: TELEGRAM_BOT_TOKEN — Bot API token from @Botfather TELEGRAM_CHAT_ID — Chat/group ID (e.g. -1002207868111) TELEGRAM_THREAD_ID — Optional topic thread ID DIGEST_SCHEDULE — Cron expression (default: '0 9 * * *')
Uses long-polling (getUpdates) — outbound HTTPS only, no ingress needed. Bot responds to /status with sync position, last 24h stats, pod uptime. Shared buildDigestMessage between daily digest and /status command. Polling stops cleanly on SIGTERM/SIGINT via AbortController.
- Add explicit radix to parseInt(threadId, 10) - Remove duplicate JSDoc on buildDigestMessage - Log error in /status command catch block - Fix cooldown test: use Date.now spy instead of resetModules
…k-crash fix(openalex-importer): guard against empty arrays after dedup filtering
Builds Docker image on push to main/develop, pushes to ECR with sha-timestamp tags for Flux image automation.
…ings vitest 4.x uses rolldown which requires Node ^20.19.0 || >=22.12.0. The previous 20.18.1 caused npm ci to skip the platform-specific optional dependency, failing the test step in Docker builds.
…l|null, not undefined
…ehind Startup message now includes full status digest (sync date, days behind, last successful import timestamp, 24h stats, pod uptime). /status command shows the same enriched data.
… queries - Total works via pg_class.reltuples (instant, no table scan) - 24h works via works_batch + batch join (small table, indexed) - Avoids COUNT(*) on openalex.works which would full-scan millions of rows
- Records section now shows 24h / 30d / total breakdown - Last import shows both timestamp and relative time (e.g. "3 hours ago") - Total works prefixed with ~ to indicate approximate count
…eptions - Unified shutdown handler sends reason to Telegram before exiting - Uncaught exceptions now send error + shutdown notification - SIGTERM shows "K8s rollout or scale-down" so deploys are visible - All Telegram sends use .catch() to never block shutdown - SIGTERM/SIGINT exit with 0 (clean shutdown, not error)
- Add 5s timeout on all shutdown Telegram sends and pool.end() so a hung network call doesn't stall until SIGKILL - Add unhandledRejection handler (Node 15+ exits on these) - Remove dead beforeExit handler (never fires with cron jobs) - Make SIGTERM/SIGINT handlers synchronous, fire-and-forget shutdown via void (avoids fragile async signal handlers)
- Point PUBLIC_IPFS_RESOLVER to test IPFS node instead of pub.desci.com - Test now adds content to test IPFS first, then uses that CID - Eliminates dependency on external IPFS gateway availability Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds a /pipelines Telegram bot command that queries the Prefect API and export_metadata to show real-time health of all downstream pipelines (ES, novelty, Qdrant). Shows overall verdict, per-pipeline status (healthy/lagging/stalled/failing), batch progress with percentages, and stall warnings. Gracefully degrades if Prefect is unreachable. Made-with: Cursor
…fect logging Made-with: Cursor
feat(openalex-importer): add /pipelines command for downstream health
…id work within 24h Made-with: Cursor
fix(openalex-importer): simplify pipeline health — healthy if ran + d…
…ommits The entire day's import was wrapped in one PG transaction. For days with millions of updated works (e.g. Jan 13), this runs for 10+ hours and if the pod crashes, all progress rolls back and it restarts from scratch. - Each 1000-work chunk now saves in its own transaction - Batch record created/finalized independently outside the stream - getNextDayToImport only considers finalized batches (finished_at IS NOT NULL) - Cleanup of orphaned batch records on crash recovery - Startup message distinguishes crash recovery (🔄) from clean start (🟢) - Daily digest prefixed with 📅 Daily Update - /stopupdate and /startupdate commands to pause/resume daily digest Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ilure - Log batchId and queryInfo when the stream pipeline fails, making it easier to trace which batch was left unfinished - Add comment documenting intentional per-day scope of cleanupUnfinishedBatches Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…eanup safety, batch index - Add TELEGRAM_ADMIN_IDS auth check on /stopupdate and /startupdate commands - Use FOR UPDATE SKIP LOCKED + 5-min staleness guard in cleanupUnfinishedBatches to prevent deleting live batches during rolling restarts - Add composite index on batch(query_type, query_from, query_to, finished_at) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…o syntax, comments - Add migration 0011 for batch_cleanup_idx composite index - Fix pino logger syntax: pass object-first in sendDailyDigest - Add clarifying comment on intentional open-access when TELEGRAM_ADMIN_IDS unset - Document global scope of hasUnfinishedBatches vs day-scoped cleanup Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix(openalex-importer): break single-day tx into per-chunk commits
Today's ipfs.desci.com outage caused a second-order outage in
desci-server: every replica crashed in lockstep on each request that
hit the IPFS gateway during the ~3 minute window it was down. Restart
counts on prod-desci-server pods hit 3 in 18 minutes.
Root cause:
1. controllers/raw/resolve.ts had two unprotected `axios.get` calls to
${ipfsResolver}/${cidString} (the version-by-cid/index branch and
the zip-streaming branch). When ipfs.desci.com returned 503, axios
threw, the rejection was unhandled in the request handler, and...
2. ...index.ts wired BOTH `uncaughtException` and `unhandledRejection`
into the same `handleFatalError` path, which calls cleanup() and
process.exit(1). So a single failing upstream HTTP call took down
the entire pod.
This was a design bug masquerading as a probe failure. A long-running
web server must survive a bad request — the offending handler should
be fixed, but the process must keep serving traffic.
Changes:
- index.ts: split unhandledRejection off from handleFatalError into a
log-only handler. uncaughtException still exits (correct: the process
state is unknown after an uncaught sync exception). Sentry still
picks up unhandled rejections via its global integration, so we
don't lose visibility.
- controllers/raw/resolve.ts: wrap the two unprotected axios.get calls
in try/catch and return a 502 to the client when the IPFS uplink
fails, matching the existing pattern in the latest-version branch
at line 75.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- resolve.ts: replace bare response.data.pipe(res) with stream.pipeline so mid-stream errors (source aborts, client disconnects) are forwarded and both streams are torn down. Add a headersSent guard so we don't attempt to send a JSON 502 after streaming has already begun. - index.ts: add a .catch() to server.ready() that calls handleFatalError. Now that unhandledRejection is non-fatal, a startup failure here would otherwise be silently logged and the process would limp along half-initialized. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…o-crash fix(desci-server): don't crash the pod on unhandled promise rejections
Chore: update prisma
Chore: update prisma sentry
Root cause: /v1/pub/versions calls dpid.org/api/v2/query/history which
takes 5-8 seconds per request (queries Ceramic for version anchoring).
This blocks every dpid page SSR render.
Fix: wrap getIndexedResearchObjects in getOrCache with 1-day TTL.
- First request: 5-8s (cold, fetches from dpid.org)
- Subsequent requests: <100ms (Redis hit)
- Cache invalidated on publish (delFromCache in publish controller)
Uses existing Redis infrastructure (getOrCache, ONE_DAY_TTL).
Trace data:
POST dpid.org/api/v2/query/history {ids:["1077"]} = 7.06s TTFB
GET /v1/pub/versions/{uuid} = 5.27s TTFB (99% is the above call)
All other SSR steps combined = <500ms
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Queries all published nodes from DB and calls getIndexedResearchObjects
for each, populating the indexed-versions-{uuid} cache keys.
Usage:
npx ts-node src/scripts/warm-versions-cache.ts
CONCURRENCY=5 npx ts-node src/scripts/warm-versions-cache.ts
Runs with concurrency=3 by default to avoid overwhelming dpid.org.
Can be run as a one-time migration after deploy, or scheduled as a
daily cron to keep caches warm.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Chore: remove legacy desci-server load balancers
perf: cache /v1/pub/versions in Redis (7s → <100ms)
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
4 tasks
Same pattern as #1281 (which cached /v1/pub/versions). The dpid SSR path in nodes-web-v2 hits both endpoints on every cold render: resolvePublishedManifest → GET /<uuid>/<version> (handled by raw/resolve.ts) loadDriveTree → GET /v1/data/pubTree/... (handled by data/retrieve.ts) The first is the slow one — it calls getIndexedResearchObjects (theGraph, ~5-8s) plus an IPFS gateway fetch. The second varies but is typically 1-3s. Together they dominate the dpid cold-load TTFB. Both responses are content-addressed: - pubTree by (uuid, manifestCid, rootCid, dataPath, depth) — manifestCid is itself a content hash, so the tuple is fully immutable. No invalidation needed; new publishes mint a new manifestCid and write a fresh cache entry. - resolve by (uuid, firstParam) where firstParam is "" / index / CID. Index- and CID-keyed entries are immutable. The "latest" key is invalidated in publish.ts alongside the existing `indexed-versions` invalidation. Cache safety: only success paths cache. Component (PDF/code) responses in resolve.ts and 4xx/5xx responses are NOT cached. Combined with the Vercel edge cache shipped in nodes-web-v2#1540, cold SSR drops from 12-16s → ~1.5s; warm SSR is unchanged at ~150ms. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
perf: cache /v1/data/pubTree and manifest resolve in Redis
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
promote main