Adding check duplicate script, in pre commit hook#59
Adding check duplicate script, in pre commit hook#59milosbjelceviczuhlke wants to merge 7 commits intomainfrom
Conversation
|
This is very cool. Thanks @milosbjelceviczuhlke. Could you also please add the duplicate check to run as part of the PR Checks here: |
There was a problem hiding this comment.
Pull Request Overview
This PR adds a duplicate topic detection script and integrates it into both the pre-commit hooks and CI workflow to prevent duplicate topics from being added to the repository.
Key Changes:
- New Python script to detect duplicate slugs, duplicate titles, and similar titles using fuzzy matching
- Integration into pre-commit hooks for local validation
- Integration into GitHub Actions PR checks workflow
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
| scripts/check_duplicate_topics.py | New duplicate detection script that checks for duplicate slugs, titles, and fuzzy-matches similar titles |
| justfile | Adds check-duplicates command to run the duplicate detection script |
| .pre-commit-config.yaml | Adds pre-commit hook to run duplicate topic check before commits |
| .github/workflows/pr-checks.yml | Adds duplicate check step to CI workflow to validate PRs |
pburls
left a comment
There was a problem hiding this comment.
Thanks for adding the PR check.
Please take a look at the suggestions from copilot.
* content: Create FSI Software Engineer achetype and transfer content into it (#42). Add content linked to external knowledge base and structured based on the Dreyfus model. Add topics for supporting material for the archetype. * Change Financial crimes to level-3 heading * content: Re-factored FSI Software Engineer archetype to conform to data architecture * Update frontmatter structure for FSI archetype across all topics to enhance consistency and usability; Implement new macros for rendering descriptions and skill stages in archetype and topic files. * Undo set shell in justfile
| norm_to_original[nt] = t["title"] | ||
|
|
There was a problem hiding this comment.
Potential issue: The norm_to_original dictionary may not always map normalized titles correctly when multiple topics have the same normalized title. If two topics normalize to the same value, only the first one's original title is stored (line 99), but when reporting similar topics (lines 131-132), both could reference the same original title instead of their actual original titles.
Consider changing the data structure to track all original titles for each normalized title:
norm_to_original = {}
for t in topics:
nt = normalize_title(t["title"])
if nt not in norm_to_original:
norm_to_original[nt] = []
if t["title"] not in norm_to_original[nt]:
norm_to_original[nt].append(t["title"])Then when displaying, use the first occurrence:
f" - '{norm_to_original[a][0]}'\n"| norm_to_original[nt] = t["title"] | |
| norm_to_original[nt] = [] | |
| if t["title"] not in norm_to_original[nt]: | |
| norm_to_original[nt].append(t["title"]) |
|
|
||
| """Convert title to comparable simplified form. | ||
| Removes known acronyms (as standalone words) for duplicate detection. | ||
| Expand ACRONYMS_TO_REMOVE as needed. | ||
| """ | ||
|
|
||
| if not t: | ||
| return "" | ||
| t = t.lower().strip() | ||
| t = t.replace("(", "").replace(")", "") | ||
| t = t.replace("-", " ") | ||
| # Remove known acronyms as standalone words | ||
| words = t.split() | ||
| words = [w for w in words if w not in ACRONYMS_TO_REMOVE] | ||
| return " ".join(words) | ||
|
|
There was a problem hiding this comment.
Inconsistent indentation: The function body is incorrectly indented with 8 spaces instead of the standard 4 spaces. The docstring and function implementation should align with standard Python indentation conventions.
def normalize_title(t: str):
"""Convert title to comparable simplified form.
Removes known acronyms (as standalone words) for duplicate detection.
Expand ACRONYMS_TO_REMOVE as needed.
"""
if not t:
return ""
t = t.lower().strip()
t = t.replace("(", "").replace(")", "")
t = t.replace("-", " ")
# Remove known acronyms as standalone words
words = t.split()
words = [w for w in words if w not in ACRONYMS_TO_REMOVE]
return " ".join(words)| """Convert title to comparable simplified form. | |
| Removes known acronyms (as standalone words) for duplicate detection. | |
| Expand ACRONYMS_TO_REMOVE as needed. | |
| """ | |
| if not t: | |
| return "" | |
| t = t.lower().strip() | |
| t = t.replace("(", "").replace(")", "") | |
| t = t.replace("-", " ") | |
| # Remove known acronyms as standalone words | |
| words = t.split() | |
| words = [w for w in words if w not in ACRONYMS_TO_REMOVE] | |
| return " ".join(words) | |
| """Convert title to comparable simplified form. | |
| Removes known acronyms (as standalone words) for duplicate detection. | |
| Expand ACRONYMS_TO_REMOVE as needed. | |
| """ | |
| if not t: | |
| return "" | |
| t = t.lower().strip() | |
| t = t.replace("(", "").replace(")", "") | |
| t = t.replace("-", " ") | |
| # Remove known acronyms as standalone words | |
| words = t.split() | |
| words = [w for w in words if w not in ACRONYMS_TO_REMOVE] | |
| return " ".join(words) | |
| TOPIC_ROOT = Path(__file__).parent.parent / "src" / "topics" | ||
|
|
||
|
|
||
| def extract_frontmatter_and_content(text: str): |
There was a problem hiding this comment.
Missing return type annotation: The function lacks a return type annotation. Based on the implementation, it returns a tuple of dict | None and str. For consistency with the existing codebase (see extract_h1_title in scripts/add_frontmatter_to_topics.py), add return type annotation:
def extract_frontmatter_and_content(text: str) -> tuple[dict | None, str]:| def extract_frontmatter_and_content(text: str): | |
| def extract_frontmatter_and_content(text: str) -> tuple[dict | None, str]: |
| return frontmatter, content | ||
|
|
||
|
|
||
| def extract_h1_title(content: str): |
There was a problem hiding this comment.
Missing return type annotation: The function lacks a return type annotation. Based on the implementation, it returns a str | None. For consistency with the existing codebase pattern, add return type annotation:
def extract_h1_title(content: str) -> str | None:| def extract_h1_title(content: str): | |
| def extract_h1_title(content: str) -> str | None: |
| return m.group(1).strip() if m else None | ||
|
|
||
|
|
||
| def load_all_topics(): |
There was a problem hiding this comment.
Missing return type annotation: The function lacks a return type annotation. Based on the implementation, it returns a list[dict]. For consistency with the existing codebase pattern, add return type annotation:
def load_all_topics() -> list[dict]:| def load_all_topics(): | |
| def load_all_topics() -> list[dict]: |
|
|
||
| # Known acronyms to remove from topic titles (as standalone words) | ||
| ACRONYMS_TO_REMOVE = {"xp", "tdd"} | ||
| def normalize_title(t: str): |
There was a problem hiding this comment.
Missing return type annotation: The function lacks a return type annotation. Based on the implementation, it returns a str. For consistency with the existing codebase pattern, add return type annotation:
def normalize_title(t: str) -> str:| def normalize_title(t: str): | |
| def normalize_title(t: str) -> str: |
Check duplicate script for topics, also added in pre commit hook