Skip to content

feat(dedupe): URL canonicalization for cross-source linking#75

Open
phjlljp wants to merge 1 commit intomvanhorn:mainfrom
phjlljp:refactor/url-canonicalize-dedupe
Open

feat(dedupe): URL canonicalization for cross-source linking#75
phjlljp wants to merge 1 commit intomvanhorn:mainfrom
phjlljp:refactor/url-canonicalize-dedupe

Conversation

@phjlljp
Copy link
Copy Markdown
Contributor

@phjlljp phjlljp commented Mar 20, 2026

Summary

Problem: Same article shared across sources isn't cross-linked

cross_source_link() detects when the same story is discussed on multiple platforms (e.g., Reddit + HN + Bluesky) and annotates items with [also on: ...] tags. Currently it relies entirely on text similarity (Jaccard on character trigrams and word tokens). This misses a common case: the same URL shared across platforms with different tracking parameters.

For example, these three URLs all point to the same article but would not be cross-linked today:

  • Reddit: https://www.example.com/article?utm_source=reddit&ref=share
  • HN: https://example.com/article
  • Bluesky: https://www.example.com/article?fbclid=abc123

The titles may also differ (Reddit editorializes, HN uses the original, Bluesky truncates), so text similarity alone doesn't reliably catch these.

Solution: URL canonicalization as a fast-path check

This PR adds a canonicalize_url() function that normalizes URLs before comparison:

  • Strips tracking/analytics query parameters (UTM, fbclid, gclid, msclkid, igshid, etc.)
  • Removes www. prefix and lowercases the host
  • Strips trailing slashes and URL fragments
  • Rejects non-HTTP schemes (file://, javascript:, etc.)
  • Sorts remaining query parameters for consistent comparison

cross_source_link() now pre-computes canonical URLs for all items (O(N), using stdlib urlparse) and checks URL identity first. If two items from different sources have the same canonical URL, they're cross-linked immediately without computing text similarity. This is both faster (string equality vs. Jaccard) and more accurate for URL-shared content.

The existing text similarity path remains as a fallback for cases where the same story is discussed but linked differently (e.g., a Reddit self-post discussing an HN thread).

Design decisions

  • Conservative tracking param list: Only strips params that are unambiguously tracking-related. Generic params like source, context, s, t are intentionally kept because they serve functional purposes on many sites (e.g., Reddit's ?context=3 for comment threading).
  • Pre-computed, not inline: Canonical URLs are computed once per item before the O(N²) pairwise loop, not inside it.
  • Scheme restriction: Non-HTTP URLs return None and are excluded from URL matching (but still participate in text similarity).

Test plan

  • Existing test suite passes (461/465, 4 pre-existing)
  • URLs with UTM/fbclid params canonicalize to the same string as bare URLs
  • ?context=3 (Reddit) and ?source=rss (functional) are preserved after canonicalization
  • file://, javascript://, and ftp:// URLs return None
  • Cross-source linking correctly links Reddit, HN, and Bluesky items sharing the same article URL with different tracking params (verified bidirectional cross-refs)

🤖 Generated with Claude Code

Add canonicalize_url() that normalizes URLs before cross-source
comparison: strips tracking params (UTM, fbclid, etc.), removes
www. prefix, normalizes scheme/host case, and drops fragments.

cross_source_link() now checks URL identity first (fast path) before
falling back to text similarity. This catches cases where the same
article is shared on Reddit, HN, and Bluesky with different tracking
parameters or mobile vs desktop URLs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
chidev added a commit to chidev/last30days-skill that referenced this pull request Mar 25, 2026
@mvanhorn
Copy link
Copy Markdown
Owner

See my reply on #76 covering all three of your PRs. Can't commit to anything right now with the v3.0 refactor underway, but I'll consider each one once it lands.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants