Skip to content

promote main#1284

Open
hubsmoke wants to merge 43 commits intomainfrom
develop
Open

promote main#1284
hubsmoke wants to merge 43 commits intomainfrom
develop

Conversation

@hubsmoke
Copy link
Copy Markdown
Member

promote main

hubsmoke and others added 30 commits April 4, 2026 05:42
The importer was crash-looping because OpenAlex returned mesh entries
with null primary key columns. After dedup filtering removed all rows,
an empty array was passed to pgp.helpers.insert() which throws.

- Add empty-array guards after dedup in updateWorksMesh, updateWorksConcepts,
  updateWorksTopics
- Make cron error handler resilient (log + retry instead of crash)
- Add npm test gate to Dockerfile so broken code can't deploy
- Add realistic Work fixtures covering null PKs, empty arrays, missing
  locations, and other real API edge cases
- Add transformer tests (13) and saveData integration tests (8)
- Add dedup edge-case tests for all-null-PK scenario (3)

Made-with: Cursor
…Data tests

Remove dead insertCalls array and unnecessary mock wrapper per CR feedback.
The tests use the real pgp.helpers for SQL generation with a mock transaction.

Made-with: Cursor
- Daily digest cron (default 9:00 UTC) reports sync position, days
  imported, duration, failed batches, and days-behind status
- Error notifications rate-limited to 1/hour to avoid crash-loop spam
- "Caught up" notification fires only on state transition (importing →
  idle), not on every cron tick
- Digest queries the batch table directly so stats survive pod restarts
- Gracefully no-ops when TELEGRAM_BOT_TOKEN / TELEGRAM_CHAT_ID are not
  set (opt-in via SOPS secret)
- 19 new tests covering send, rate limiting, dedup, error fallback

Env vars to configure:
  TELEGRAM_BOT_TOKEN  — Bot API token from @Botfather
  TELEGRAM_CHAT_ID    — Chat/group ID (e.g. -1002207868111)
  TELEGRAM_THREAD_ID  — Optional topic thread ID
  DIGEST_SCHEDULE     — Cron expression (default: '0 9 * * *')
Uses long-polling (getUpdates) — outbound HTTPS only, no ingress needed.
Bot responds to /status with sync position, last 24h stats, pod uptime.
Shared buildDigestMessage between daily digest and /status command.
Polling stops cleanly on SIGTERM/SIGINT via AbortController.
- Add explicit radix to parseInt(threadId, 10)
- Remove duplicate JSDoc on buildDigestMessage
- Log error in /status command catch block
- Fix cooldown test: use Date.now spy instead of resetModules
…k-crash

fix(openalex-importer): guard against empty arrays after dedup filtering
Builds Docker image on push to main/develop, pushes to ECR with
sha-timestamp tags for Flux image automation.
…ings

vitest 4.x uses rolldown which requires Node ^20.19.0 || >=22.12.0.
The previous 20.18.1 caused npm ci to skip the platform-specific
optional dependency, failing the test step in Docker builds.
…ehind

Startup message now includes full status digest (sync date, days behind,
last successful import timestamp, 24h stats, pod uptime).
/status command shows the same enriched data.
… queries

- Total works via pg_class.reltuples (instant, no table scan)
- 24h works via works_batch + batch join (small table, indexed)
- Avoids COUNT(*) on openalex.works which would full-scan millions of rows
- Records section now shows 24h / 30d / total breakdown
- Last import shows both timestamp and relative time (e.g. "3 hours ago")
- Total works prefixed with ~ to indicate approximate count
…eptions

- Unified shutdown handler sends reason to Telegram before exiting
- Uncaught exceptions now send error + shutdown notification
- SIGTERM shows "K8s rollout or scale-down" so deploys are visible
- All Telegram sends use .catch() to never block shutdown
- SIGTERM/SIGINT exit with 0 (clean shutdown, not error)
- Add 5s timeout on all shutdown Telegram sends and pool.end()
  so a hung network call doesn't stall until SIGKILL
- Add unhandledRejection handler (Node 15+ exits on these)
- Remove dead beforeExit handler (never fires with cron jobs)
- Make SIGTERM/SIGINT handlers synchronous, fire-and-forget
  shutdown via void (avoids fragile async signal handlers)
- Point PUBLIC_IPFS_RESOLVER to test IPFS node instead of pub.desci.com
- Test now adds content to test IPFS first, then uses that CID
- Eliminates dependency on external IPFS gateway availability

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds a /pipelines Telegram bot command that queries the Prefect API
and export_metadata to show real-time health of all downstream
pipelines (ES, novelty, Qdrant). Shows overall verdict, per-pipeline
status (healthy/lagging/stalled/failing), batch progress with
percentages, and stall warnings. Gracefully degrades if Prefect is
unreachable.

Made-with: Cursor
feat(openalex-importer): add /pipelines command for downstream health
fix(openalex-importer): simplify pipeline health — healthy if ran + d…
…ommits

The entire day's import was wrapped in one PG transaction. For days with
millions of updated works (e.g. Jan 13), this runs for 10+ hours and if
the pod crashes, all progress rolls back and it restarts from scratch.

- Each 1000-work chunk now saves in its own transaction
- Batch record created/finalized independently outside the stream
- getNextDayToImport only considers finalized batches (finished_at IS NOT NULL)
- Cleanup of orphaned batch records on crash recovery
- Startup message distinguishes crash recovery (🔄) from clean start (🟢)
- Daily digest prefixed with 📅 Daily Update
- /stopupdate and /startupdate commands to pause/resume daily digest

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ilure

- Log batchId and queryInfo when the stream pipeline fails, making it
  easier to trace which batch was left unfinished
- Add comment documenting intentional per-day scope of cleanupUnfinishedBatches

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…eanup safety, batch index

- Add TELEGRAM_ADMIN_IDS auth check on /stopupdate and /startupdate commands
- Use FOR UPDATE SKIP LOCKED + 5-min staleness guard in cleanupUnfinishedBatches
  to prevent deleting live batches during rolling restarts
- Add composite index on batch(query_type, query_from, query_to, finished_at)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…o syntax, comments

- Add migration 0011 for batch_cleanup_idx composite index
- Fix pino logger syntax: pass object-first in sendDailyDigest
- Add clarifying comment on intentional open-access when TELEGRAM_ADMIN_IDS unset
- Document global scope of hasUnfinishedBatches vs day-scoped cleanup

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix(openalex-importer): break single-day tx into per-chunk commits
Today's ipfs.desci.com outage caused a second-order outage in
desci-server: every replica crashed in lockstep on each request that
hit the IPFS gateway during the ~3 minute window it was down. Restart
counts on prod-desci-server pods hit 3 in 18 minutes.

Root cause:

1. controllers/raw/resolve.ts had two unprotected `axios.get` calls to
   ${ipfsResolver}/${cidString} (the version-by-cid/index branch and
   the zip-streaming branch). When ipfs.desci.com returned 503, axios
   threw, the rejection was unhandled in the request handler, and...

2. ...index.ts wired BOTH `uncaughtException` and `unhandledRejection`
   into the same `handleFatalError` path, which calls cleanup() and
   process.exit(1). So a single failing upstream HTTP call took down
   the entire pod.

This was a design bug masquerading as a probe failure. A long-running
web server must survive a bad request — the offending handler should
be fixed, but the process must keep serving traffic.

Changes:

- index.ts: split unhandledRejection off from handleFatalError into a
  log-only handler. uncaughtException still exits (correct: the process
  state is unknown after an uncaught sync exception). Sentry still
  picks up unhandled rejections via its global integration, so we
  don't lose visibility.
- controllers/raw/resolve.ts: wrap the two unprotected axios.get calls
  in try/catch and return a 502 to the client when the IPFS uplink
  fails, matching the existing pattern in the latest-version branch
  at line 75.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- resolve.ts: replace bare response.data.pipe(res) with
  stream.pipeline so mid-stream errors (source aborts, client
  disconnects) are forwarded and both streams are torn down. Add a
  headersSent guard so we don't attempt to send a JSON 502 after
  streaming has already begun.
- index.ts: add a .catch() to server.ready() that calls
  handleFatalError. Now that unhandledRejection is non-fatal, a
  startup failure here would otherwise be silently logged and the
  process would limp along half-initialized.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…o-crash

fix(desci-server): don't crash the pod on unhandled promise rejections
ogbanugot and others added 11 commits April 14, 2026 18:08
Root cause: /v1/pub/versions calls dpid.org/api/v2/query/history which
takes 5-8 seconds per request (queries Ceramic for version anchoring).
This blocks every dpid page SSR render.

Fix: wrap getIndexedResearchObjects in getOrCache with 1-day TTL.
- First request: 5-8s (cold, fetches from dpid.org)
- Subsequent requests: <100ms (Redis hit)
- Cache invalidated on publish (delFromCache in publish controller)

Uses existing Redis infrastructure (getOrCache, ONE_DAY_TTL).

Trace data:
  POST dpid.org/api/v2/query/history {ids:["1077"]} = 7.06s TTFB
  GET /v1/pub/versions/{uuid} = 5.27s TTFB (99% is the above call)
  All other SSR steps combined = <500ms

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Queries all published nodes from DB and calls getIndexedResearchObjects
for each, populating the indexed-versions-{uuid} cache keys.

Usage:
  npx ts-node src/scripts/warm-versions-cache.ts
  CONCURRENCY=5 npx ts-node src/scripts/warm-versions-cache.ts

Runs with concurrency=3 by default to avoid overwhelming dpid.org.
Can be run as a one-time migration after deploy, or scheduled as a
daily cron to keep caches warm.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Chore: remove legacy desci-server load balancers
perf: cache /v1/pub/versions in Redis (7s → <100ms)
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 29, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8ba10c7d-a553-4eef-8def-095e69475c25

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch develop

Comment @coderabbitai help to get the list of available commands and usage tips.

hubsmoke and others added 2 commits May 2, 2026 21:41
Same pattern as #1281 (which cached /v1/pub/versions). The dpid SSR path
in nodes-web-v2 hits both endpoints on every cold render:

  resolvePublishedManifest → GET /<uuid>/<version>     (handled by raw/resolve.ts)
  loadDriveTree           → GET /v1/data/pubTree/...  (handled by data/retrieve.ts)

The first is the slow one — it calls getIndexedResearchObjects (theGraph,
~5-8s) plus an IPFS gateway fetch. The second varies but is typically
1-3s. Together they dominate the dpid cold-load TTFB.

Both responses are content-addressed:
- pubTree by (uuid, manifestCid, rootCid, dataPath, depth) — manifestCid
  is itself a content hash, so the tuple is fully immutable. No
  invalidation needed; new publishes mint a new manifestCid and write
  a fresh cache entry.
- resolve by (uuid, firstParam) where firstParam is "" / index / CID.
  Index- and CID-keyed entries are immutable. The "latest" key is
  invalidated in publish.ts alongside the existing `indexed-versions`
  invalidation.

Cache safety: only success paths cache. Component (PDF/code) responses
in resolve.ts and 4xx/5xx responses are NOT cached.

Combined with the Vercel edge cache shipped in nodes-web-v2#1540, cold
SSR drops from 12-16s → ~1.5s; warm SSR is unchanged at ~150ms.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
perf: cache /v1/data/pubTree and manifest resolve in Redis
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants