Skip to content

feat(extract): add Markdown structural extraction (.md/.mdx) + sync collect_files extensions with _DISPATCH#711

Merged
safishamsi merged 3 commits intosafishamsi:v7from
imfarhanm:feat/md-extraction-and-extensions-sync
May 6, 2026
Merged

feat(extract): add Markdown structural extraction (.md/.mdx) + sync collect_files extensions with _DISPATCH#711
safishamsi merged 3 commits intosafishamsi:v7from
imfarhanm:feat/md-extraction-and-extensions-sync

Conversation

@imfarhanm
Copy link
Copy Markdown
Contributor

Summary

Two improvements to the extraction pipeline:

1. Markdown Structural Extraction (NEW)

Added extract_markdown() - a lightweight, zero-dependency extractor that structurally indexes .md and .mdx files into the knowledge graph.

What gets extracted:

  • Headings (#, ##, ### etc.) -> become graph nodes
    • Fenced code blocks (```) -> become nodes with language tags (e.g., code:bash, code:python)
    • Nesting hierarchy -> heading->sub-heading and heading -> code-block produce contains edges
      Why: Markdown files (READMEs, deploy guides, ADRs) contain critical architectural knowledge that was previously invisible to the graph. graphify query "deploy" would miss deploy.md entirely.

Implementation: Pure regex/line-by-line parsing - no tree-sitter dependency, no new packages.

2. Sync collect_files Extensions with _DISPATCH (BUG FIX)

collect_files() had a hardcoded _EXTENSIONS set that was missing 18 extensions already supported by _DISPATCH:

.jsx, .mjs, .ex, .exs, .jl, .vue, .svelte, .dart, .v, .sv, .sql, .f, .F, .f90, .F90, .f95, .F95, .f03, .F03, .f08, .F08

Files with these extensions were silently skipped during indexing even though extractors existed for them.

Fix: Replaced the hardcoded set with set(_DISPATCH.keys()) so it stays automatically in sync as new languages are added.

Tests

  • 6 new test cases for markdown extraction (all passing)
    • Updated test_collect_files_from_dir to use dynamic extension set
    • Full test suite: 153 passed, 0 regressions (1 pre-existing Fortran .F90 failure)

Files Changed

File Change
graphify/extract.py Added extract_markdown(), registered .md/.mdx in _DISPATCH, replaced hardcoded _EXTENSIONS
tests/fixtures/deploy_guide.md New test fixture
tests/test_languages.py 6 new markdown tests
tests/test_extract.py Updated test_collect_files_from_dir

imfarhanm added 3 commits May 4, 2026 19:08
…s extensions

1. NEW: extract_markdown() — structurally indexes .md/.mdx files into the
   knowledge graph. Headings become nodes, code blocks become nodes with
   language tags, and nesting produces 'contains' edges. Zero new deps
   (pure regex/line-by-line parsing, no tree-sitter needed).

2. FIX: collect_files() _EXTENSIONS was hardcoded and missing 18 extensions
   that _DISPATCH already supported (.jsx, .mjs, .ex, .exs, .jl, .vue,
   .svelte, .dart, .v, .sv, .sql, .f, .F, .f90, etc). Now uses
   set(_DISPATCH.keys()) to stay automatically in sync.

3. Added deploy_guide.md test fixture and 6 new test cases.
4. Updated test_collect_files_from_dir to use dynamic extension set.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants