feat(extract): add Markdown structural extraction (.md/.mdx) + sync collect_files extensions with _DISPATCH#711
Merged
safishamsi merged 3 commits intosafishamsi:v7from May 6, 2026
Conversation
…s extensions 1. NEW: extract_markdown() — structurally indexes .md/.mdx files into the knowledge graph. Headings become nodes, code blocks become nodes with language tags, and nesting produces 'contains' edges. Zero new deps (pure regex/line-by-line parsing, no tree-sitter needed). 2. FIX: collect_files() _EXTENSIONS was hardcoded and missing 18 extensions that _DISPATCH already supported (.jsx, .mjs, .ex, .exs, .jl, .vue, .svelte, .dart, .v, .sv, .sql, .f, .F, .f90, etc). Now uses set(_DISPATCH.keys()) to stay automatically in sync. 3. Added deploy_guide.md test fixture and 6 new test cases. 4. Updated test_collect_files_from_dir to use dynamic extension set.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two improvements to the extraction pipeline:
1. Markdown Structural Extraction (NEW)
Added
extract_markdown()- a lightweight, zero-dependency extractor that structurally indexes.mdand.mdxfiles into the knowledge graph.What gets extracted:
Why: Markdown files (READMEs, deploy guides, ADRs) contain critical architectural knowledge that was previously invisible to the graph. graphify query "deploy" would miss deploy.md entirely.
Implementation: Pure regex/line-by-line parsing - no tree-sitter dependency, no new packages.
2. Sync collect_files Extensions with _DISPATCH (BUG FIX)
collect_files() had a hardcoded _EXTENSIONS set that was missing 18 extensions already supported by _DISPATCH:
.jsx, .mjs, .ex, .exs, .jl, .vue, .svelte, .dart, .v, .sv, .sql, .f, .F, .f90, .F90, .f95, .F95, .f03, .F03, .f08, .F08
Files with these extensions were silently skipped during indexing even though extractors existed for them.
Fix: Replaced the hardcoded set with set(_DISPATCH.keys()) so it stays automatically in sync as new languages are added.
Tests
Files Changed