-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Description:
Create the main synchronization script that coordinates sitemap fetching, article comparison, JSON fetching, database updates, deletion of missing articles, and vector index updates.
Acceptance criteria:
- Script src/scripts/sync_articles.py with CLI arguments: --dry-run, --batch-size (default: 100), --debug
- Fetches all sitemaps and builds set of current article URLs
- Compares sitemap lastmod against database changed_at to identify new/updated articles
- Fetches article JSON from {url}.json for each changed article with retry logic (3 attempts, exponential backoff)
- Processes updates in batches with per-article error handling (log error, skip article, continue batch)
- Identifies articles in database but missing from all sitemaps and deletes them
- After all article updates complete, triggers vector index update
- Logs summary statistics: articles checked, new, updated, deleted, failed, vector indexes updated
- Dry-run mode logs actions without database modifications
- Exit code 0 on success, 1 on critical failure (e.g., cannot connect to DB)
Technical details:
- Use structured logging with article_id, url, and error details
- Add progress logging every N articles (e.g., "Processed 500/2000 articles...")
Design:
Optional details on design for context.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels