Skip to content
This repository was archived by the owner on Dec 12, 2025. It is now read-only.

Adding check duplicate script, in pre commit hook#59

Open
milosbjelceviczuhlke wants to merge 7 commits intomainfrom
feature/check-duplicate-script
Open

Adding check duplicate script, in pre commit hook#59
milosbjelceviczuhlke wants to merge 7 commits intomainfrom
feature/check-duplicate-script

Conversation

@milosbjelceviczuhlke
Copy link
Copy Markdown
Collaborator

Check duplicate script for topics, also added in pre commit hook

@pburls
Copy link
Copy Markdown
Contributor

pburls commented Nov 19, 2025

This is very cool. Thanks @milosbjelceviczuhlke.

Could you also please add the duplicate check to run as part of the PR Checks here:
https://github.com/Zuehlke/archetypes/blob/main/.github/workflows/pr-checks.yml

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a duplicate topic detection script and integrates it into both the pre-commit hooks and CI workflow to prevent duplicate topics from being added to the repository.

Key Changes:

  • New Python script to detect duplicate slugs, duplicate titles, and similar titles using fuzzy matching
  • Integration into pre-commit hooks for local validation
  • Integration into GitHub Actions PR checks workflow

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.

File Description
scripts/check_duplicate_topics.py New duplicate detection script that checks for duplicate slugs, titles, and fuzzy-matches similar titles
justfile Adds check-duplicates command to run the duplicate detection script
.pre-commit-config.yaml Adds pre-commit hook to run duplicate topic check before commits
.github/workflows/pr-checks.yml Adds duplicate check step to CI workflow to validate PRs

Copy link
Copy Markdown
Contributor

@pburls pburls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the PR check.

Please take a look at the suggestions from copilot.

MilosBjelcevic167 and others added 4 commits November 21, 2025 18:24
* content: Create FSI Software Engineer achetype and transfer content into it (#42).
Add content linked to external knowledge base and structured based on the Dreyfus model.
Add topics for supporting material for the archetype.

* Change Financial crimes to level-3 heading

* content: Re-factored FSI Software Engineer archetype to conform to data architecture

* Update frontmatter structure for FSI archetype across all topics to enhance consistency and usability; Implement new macros for rendering descriptions and skill stages in archetype and topic files.

* Undo set shell in justfile
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

Comment on lines +99 to +100
norm_to_original[nt] = t["title"]

Copy link

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential issue: The norm_to_original dictionary may not always map normalized titles correctly when multiple topics have the same normalized title. If two topics normalize to the same value, only the first one's original title is stored (line 99), but when reporting similar topics (lines 131-132), both could reference the same original title instead of their actual original titles.

Consider changing the data structure to track all original titles for each normalized title:

norm_to_original = {}
for t in topics:
    nt = normalize_title(t["title"])
    if nt not in norm_to_original:
        norm_to_original[nt] = []
    if t["title"] not in norm_to_original[nt]:
        norm_to_original[nt].append(t["title"])

Then when displaying, use the first occurrence:

f"   - '{norm_to_original[a][0]}'\n"
Suggested change
norm_to_original[nt] = t["title"]
norm_to_original[nt] = []
if t["title"] not in norm_to_original[nt]:
norm_to_original[nt].append(t["title"])

Copilot uses AI. Check for mistakes.
Comment on lines +71 to +86

"""Convert title to comparable simplified form.
Removes known acronyms (as standalone words) for duplicate detection.
Expand ACRONYMS_TO_REMOVE as needed.
"""

if not t:
return ""
t = t.lower().strip()
t = t.replace("(", "").replace(")", "")
t = t.replace("-", " ")
# Remove known acronyms as standalone words
words = t.split()
words = [w for w in words if w not in ACRONYMS_TO_REMOVE]
return " ".join(words)

Copy link

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent indentation: The function body is incorrectly indented with 8 spaces instead of the standard 4 spaces. The docstring and function implementation should align with standard Python indentation conventions.

def normalize_title(t: str):
    """Convert title to comparable simplified form.
    Removes known acronyms (as standalone words) for duplicate detection.
    Expand ACRONYMS_TO_REMOVE as needed.
    """
    
    if not t:
        return ""
    t = t.lower().strip()
    t = t.replace("(", "").replace(")", "")
    t = t.replace("-", " ")
    # Remove known acronyms as standalone words
    words = t.split()
    words = [w for w in words if w not in ACRONYMS_TO_REMOVE]
    return " ".join(words)
Suggested change
"""Convert title to comparable simplified form.
Removes known acronyms (as standalone words) for duplicate detection.
Expand ACRONYMS_TO_REMOVE as needed.
"""
if not t:
return ""
t = t.lower().strip()
t = t.replace("(", "").replace(")", "")
t = t.replace("-", " ")
# Remove known acronyms as standalone words
words = t.split()
words = [w for w in words if w not in ACRONYMS_TO_REMOVE]
return " ".join(words)
"""Convert title to comparable simplified form.
Removes known acronyms (as standalone words) for duplicate detection.
Expand ACRONYMS_TO_REMOVE as needed.
"""
if not t:
return ""
t = t.lower().strip()
t = t.replace("(", "").replace(")", "")
t = t.replace("-", " ")
# Remove known acronyms as standalone words
words = t.split()
words = [w for w in words if w not in ACRONYMS_TO_REMOVE]
return " ".join(words)

Copilot uses AI. Check for mistakes.
TOPIC_ROOT = Path(__file__).parent.parent / "src" / "topics"


def extract_frontmatter_and_content(text: str):
Copy link

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing return type annotation: The function lacks a return type annotation. Based on the implementation, it returns a tuple of dict | None and str. For consistency with the existing codebase (see extract_h1_title in scripts/add_frontmatter_to_topics.py), add return type annotation:

def extract_frontmatter_and_content(text: str) -> tuple[dict | None, str]:
Suggested change
def extract_frontmatter_and_content(text: str):
def extract_frontmatter_and_content(text: str) -> tuple[dict | None, str]:

Copilot uses AI. Check for mistakes.
return frontmatter, content


def extract_h1_title(content: str):
Copy link

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing return type annotation: The function lacks a return type annotation. Based on the implementation, it returns a str | None. For consistency with the existing codebase pattern, add return type annotation:

def extract_h1_title(content: str) -> str | None:
Suggested change
def extract_h1_title(content: str):
def extract_h1_title(content: str) -> str | None:

Copilot uses AI. Check for mistakes.
return m.group(1).strip() if m else None


def load_all_topics():
Copy link

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing return type annotation: The function lacks a return type annotation. Based on the implementation, it returns a list[dict]. For consistency with the existing codebase pattern, add return type annotation:

def load_all_topics() -> list[dict]:
Suggested change
def load_all_topics():
def load_all_topics() -> list[dict]:

Copilot uses AI. Check for mistakes.

# Known acronyms to remove from topic titles (as standalone words)
ACRONYMS_TO_REMOVE = {"xp", "tdd"}
def normalize_title(t: str):
Copy link

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing return type annotation: The function lacks a return type annotation. Based on the implementation, it returns a str. For consistency with the existing codebase pattern, add return type annotation:

def normalize_title(t: str) -> str:
Suggested change
def normalize_title(t: str):
def normalize_title(t: str) -> str:

Copilot uses AI. Check for mistakes.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants