Adding check duplicate script, in pre commit hook by milosbjelceviczuhlke · Pull Request #59 · Zuehlke/archetypes

milosbjelceviczuhlke · 2025-11-19T09:18:00Z

Check duplicate script for topics, also added in pre commit hook

pburls · 2025-11-19T11:07:22Z

This is very cool. Thanks @milosbjelceviczuhlke.

Could you also please add the duplicate check to run as part of the PR Checks here:
https://github.com/Zuehlke/archetypes/blob/main/.github/workflows/pr-checks.yml

Copilot

Pull Request Overview

This PR adds a duplicate topic detection script and integrates it into both the pre-commit hooks and CI workflow to prevent duplicate topics from being added to the repository.

Key Changes:

New Python script to detect duplicate slugs, duplicate titles, and similar titles using fuzzy matching
Integration into pre-commit hooks for local validation
Integration into GitHub Actions PR checks workflow

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.

File	Description
scripts/check_duplicate_topics.py	New duplicate detection script that checks for duplicate slugs, titles, and fuzzy-matches similar titles
justfile	Adds `check-duplicates` command to run the duplicate detection script
.pre-commit-config.yaml	Adds pre-commit hook to run duplicate topic check before commits
.github/workflows/pr-checks.yml	Adds duplicate check step to CI workflow to validate PRs

scripts/check_duplicate_topics.py

.github/workflows/pr-checks.yml

scripts/check_duplicate_topics.py

justfile

scripts/check_duplicate_topics.py

pburls

Thanks for adding the PR check.

Please take a look at the suggestions from copilot.

* content: Create FSI Software Engineer achetype and transfer content into it (#42). Add content linked to external knowledge base and structured based on the Dreyfus model. Add topics for supporting material for the archetype. * Change Financial crimes to level-3 heading * content: Re-factored FSI Software Engineer archetype to conform to data architecture * Update frontmatter structure for FSI archetype across all topics to enhance consistency and usability; Implement new macros for rendering descriptions and skill stages in archetype and topic files. * Undo set shell in justfile

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

Copilot · 2025-12-10T08:33:34Z

scripts/check_duplicate_topics.py

+            norm_to_original[nt] = t["title"]
+


Potential issue: The norm_to_original dictionary may not always map normalized titles correctly when multiple topics have the same normalized title. If two topics normalize to the same value, only the first one's original title is stored (line 99), but when reporting similar topics (lines 131-132), both could reference the same original title instead of their actual original titles.

Consider changing the data structure to track all original titles for each normalized title:

norm_to_original = {} for t in topics: nt = normalize_title(t["title"]) if nt not in norm_to_original: norm_to_original[nt] = [] if t["title"] not in norm_to_original[nt]: norm_to_original[nt].append(t["title"])

Then when displaying, use the first occurrence:

f" - '{norm_to_original[a][0]}'\n"

Suggested change

norm_to_original[nt] = t["title"]

norm_to_original[nt] = []

if t["title"] not in norm_to_original[nt]:

norm_to_original[nt].append(t["title"])

Copilot · 2025-12-10T08:33:34Z

scripts/check_duplicate_topics.py

+
+        """Convert title to comparable simplified form.
+        Removes known acronyms (as standalone words) for duplicate detection.
+        Expand ACRONYMS_TO_REMOVE as needed.
+        """
+
+        if not t:
+            return ""
+        t = t.lower().strip()
+        t = t.replace("(", "").replace(")", "")
+        t = t.replace("-", " ")
+        # Remove known acronyms as standalone words
+        words = t.split()
+        words = [w for w in words if w not in ACRONYMS_TO_REMOVE]
+        return " ".join(words)
+


Inconsistent indentation: The function body is incorrectly indented with 8 spaces instead of the standard 4 spaces. The docstring and function implementation should align with standard Python indentation conventions.

def normalize_title(t: str): """Convert title to comparable simplified form. Removes known acronyms (as standalone words) for duplicate detection. Expand ACRONYMS_TO_REMOVE as needed. """ if not t: return "" t = t.lower().strip() t = t.replace("(", "").replace(")", "") t = t.replace("-", " ") # Remove known acronyms as standalone words words = t.split() words = [w for w in words if w not in ACRONYMS_TO_REMOVE] return " ".join(words)

Suggested change

"""Convert title to comparable simplified form.

Removes known acronyms (as standalone words) for duplicate detection.

Expand ACRONYMS_TO_REMOVE as needed.

"""

if not t:

return ""

t = t.lower().strip()

t = t.replace("(", "").replace(")", "")

t = t.replace("-", " ")

# Remove known acronyms as standalone words

words = t.split()

words = [w for w in words if w not in ACRONYMS_TO_REMOVE]

return " ".join(words)

"""Convert title to comparable simplified form.

Removes known acronyms (as standalone words) for duplicate detection.

Expand ACRONYMS_TO_REMOVE as needed.

"""

if not t:

return ""

t = t.lower().strip()

t = t.replace("(", "").replace(")", "")

t = t.replace("-", " ")

# Remove known acronyms as standalone words

words = t.split()

words = [w for w in words if w not in ACRONYMS_TO_REMOVE]

return " ".join(words)

Copilot · 2025-12-10T08:33:34Z

scripts/check_duplicate_topics.py

+TOPIC_ROOT = Path(__file__).parent.parent / "src" / "topics"
+
+
+def extract_frontmatter_and_content(text: str):


Missing return type annotation: The function lacks a return type annotation. Based on the implementation, it returns a tuple of dict | None and str. For consistency with the existing codebase (see extract_h1_title in scripts/add_frontmatter_to_topics.py), add return type annotation:

def extract_frontmatter_and_content(text: str) -> tuple[dict | None, str]:

Suggested change

def extract_frontmatter_and_content(text: str):

def extract_frontmatter_and_content(text: str) -> tuple[dict | None, str]:

Copilot · 2025-12-10T08:33:34Z

scripts/check_duplicate_topics.py

+    return frontmatter, content
+
+
+def extract_h1_title(content: str):


Missing return type annotation: The function lacks a return type annotation. Based on the implementation, it returns a str | None. For consistency with the existing codebase pattern, add return type annotation:

def extract_h1_title(content: str) -> str | None:

Suggested change

def extract_h1_title(content: str):

def extract_h1_title(content: str) -> str | None:

Copilot · 2025-12-10T08:33:35Z

scripts/check_duplicate_topics.py

+    return m.group(1).strip() if m else None
+
+
+def load_all_topics():


Missing return type annotation: The function lacks a return type annotation. Based on the implementation, it returns a list[dict]. For consistency with the existing codebase pattern, add return type annotation:

def load_all_topics() -> list[dict]:

Suggested change

def load_all_topics():

def load_all_topics() -> list[dict]:

Copilot · 2025-12-10T08:33:35Z

scripts/check_duplicate_topics.py

+
+# Known acronyms to remove from topic titles (as standalone words)
+ACRONYMS_TO_REMOVE = {"xp", "tdd"}
+def normalize_title(t: str):


Missing return type annotation: The function lacks a return type annotation. Based on the implementation, it returns a str. For consistency with the existing codebase pattern, add return type annotation:

def normalize_title(t: str) -> str:

Suggested change

def normalize_title(t: str):

def normalize_title(t: str) -> str:

Adding check duplicate script, in pre commit hook

7e44099

merge main and adding duplicate into pr checks

b1f33b9

milosbjelceviczuhlke requested a review from pburls November 19, 2025 17:56

Merge branch 'main' into feature/check-duplicate-script

cb6c12a

pburls requested a review from Copilot November 20, 2025 09:59

Copilot started reviewing on behalf of pburls November 20, 2025 09:59 View session

Copilot finished reviewing on behalf of pburls November 20, 2025 10:02

Copilot AI reviewed Nov 20, 2025

View reviewed changes

pburls reviewed Nov 21, 2025

View reviewed changes

MilosBjelcevic167 and others added 4 commits November 21, 2025 18:24

pr review resolved

2a83d81

merge main

314fe1c

Merge branch 'main' into feature/check-duplicate-script

8060678

milosbjelceviczuhlke requested a review from pburls November 21, 2025 17:26

pburls requested a review from Copilot December 10, 2025 08:29

Copilot started reviewing on behalf of pburls December 10, 2025 08:30 View session

Copilot AI reviewed Dec 10, 2025

View reviewed changes

-            norm_to_original[nt] = t["title"]
+            norm_to_original[nt] = []
+        if t["title"] not in norm_to_original[nt]:
+            norm_to_original[nt].append(t["title"])

		TOPIC_ROOT = Path(__file__).parent.parent / "src" / "topics"


		def extract_frontmatter_and_content(text: str):

		return frontmatter, content


		def extract_h1_title(content: str):

	def extract_h1_title(content: str):
	def extract_h1_title(content: str) -> str \| None:

		return m.group(1).strip() if m else None


		def load_all_topics():

	def normalize_title(t: str):
	def normalize_title(t: str) -> str:

Conversation

milosbjelceviczuhlke commented Nov 19, 2025

Uh oh!

pburls commented Nov 19, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pburls left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants