feat(dedupe): URL canonicalization for cross-source linking#75
Open
phjlljp wants to merge 1 commit intomvanhorn:mainfrom
Open
feat(dedupe): URL canonicalization for cross-source linking#75phjlljp wants to merge 1 commit intomvanhorn:mainfrom
phjlljp wants to merge 1 commit intomvanhorn:mainfrom
Conversation
Add canonicalize_url() that normalizes URLs before cross-source comparison: strips tracking params (UTM, fbclid, etc.), removes www. prefix, normalizes scheme/host case, and drops fragments. cross_source_link() now checks URL identity first (fast path) before falling back to text similarity. This catches cases where the same article is shared on Reddit, HN, and Bluesky with different tracking parameters or mobile vs desktop URLs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
chidev
added a commit
to chidev/last30days-skill
that referenced
this pull request
Mar 25, 2026
Owner
|
See my reply on #76 covering all three of your PRs. Can't commit to anything right now with the v3.0 refactor underway, but I'll consider each one once it lands. |
7 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Problem: Same article shared across sources isn't cross-linked
cross_source_link()detects when the same story is discussed on multiple platforms (e.g., Reddit + HN + Bluesky) and annotates items with[also on: ...]tags. Currently it relies entirely on text similarity (Jaccard on character trigrams and word tokens). This misses a common case: the same URL shared across platforms with different tracking parameters.For example, these three URLs all point to the same article but would not be cross-linked today:
https://www.example.com/article?utm_source=reddit&ref=sharehttps://example.com/articlehttps://www.example.com/article?fbclid=abc123The titles may also differ (Reddit editorializes, HN uses the original, Bluesky truncates), so text similarity alone doesn't reliably catch these.
Solution: URL canonicalization as a fast-path check
This PR adds a
canonicalize_url()function that normalizes URLs before comparison:www.prefix and lowercases the hostfile://,javascript:, etc.)cross_source_link()now pre-computes canonical URLs for all items (O(N), using stdliburlparse) and checks URL identity first. If two items from different sources have the same canonical URL, they're cross-linked immediately without computing text similarity. This is both faster (string equality vs. Jaccard) and more accurate for URL-shared content.The existing text similarity path remains as a fallback for cases where the same story is discussed but linked differently (e.g., a Reddit self-post discussing an HN thread).
Design decisions
source,context,s,tare intentionally kept because they serve functional purposes on many sites (e.g., Reddit's?context=3for comment threading).Noneand are excluded from URL matching (but still participate in text similarity).Test plan
?context=3(Reddit) and?source=rss(functional) are preserved after canonicalizationfile://,javascript://, andftp://URLs returnNone🤖 Generated with Claude Code