Skip to content

[Prod DB Integration] Article Synchronization Orchestrator Script #70

@Enniwhere

Description

@Enniwhere

Description:
Create the main synchronization script that coordinates sitemap fetching, article comparison, JSON fetching, database updates, deletion of missing articles, and vector index updates.

Acceptance criteria:

  • Script src/scripts/sync_articles.py with CLI arguments: --dry-run, --batch-size (default: 100), --debug
  • Fetches all sitemaps and builds set of current article URLs
  • Compares sitemap lastmod against database changed_at to identify new/updated articles
  • Fetches article JSON from {url}.json for each changed article with retry logic (3 attempts, exponential backoff)
  • Processes updates in batches with per-article error handling (log error, skip article, continue batch)
  • Identifies articles in database but missing from all sitemaps and deletes them
  • After all article updates complete, triggers vector index update
  • Logs summary statistics: articles checked, new, updated, deleted, failed, vector indexes updated
  • Dry-run mode logs actions without database modifications
  • Exit code 0 on success, 1 on critical failure (e.g., cannot connect to DB)

Technical details:

  • Use structured logging with article_id, url, and error details
  • Add progress logging every N articles (e.g., "Processed 500/2000 articles...")

Design:
Optional details on design for context.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions