Skip to content

feat(parsers): tree-sitter AST grammars for Java, C, C++, C#, Ruby, PHP#1313

Merged
gfargo merged 1 commit into
mainfrom
agent/coco-15-coco-1239-tree-sitter-ast-grammars-for-j
Jun 22, 2026
Merged

feat(parsers): tree-sitter AST grammars for Java, C, C++, C#, Ruby, PHP#1313
gfargo merged 1 commit into
mainfrom
agent/coco-15-coco-1239-tree-sitter-ast-grammars-for-j

Conversation

@gfargo-horizon-agent

Copy link
Copy Markdown
Contributor

What

Upgrades the six regex-first structural fast-path parsers (Java, C, C++, C#, Ruby, PHP) to real tree-sitter AST extraction, matching the pattern already established for TypeScript, Python, Rust, and Go.

Why

Plane: COCO-15

How

Infrastructure (lazy-load pipeline)

  • cache.ts: extends LazyTreeSitterLanguageId with java | c | cpp | csharp | ruby | php
  • manifest.ts: adds SHA-256-pinned WASM manifest entries for all six grammars (versions pinned at edit time, hashes verified against CDN)
  • runtime.ts: wires the six new lazy cache paths into the language resolver
  • prefetch.ts: adds COCO_PREFETCH aliases — java, c, cpp/c++/cxx, cs/csharp/c#, rb/ruby, php

New tree-sitter extractors (each in src/lib/parsers/default/__tree_sitter__/)

  • javaTreeSitterParser.tsclass_declaration, interface_declaration, enum_declaration, record_declaration, method_declaration, constructor_declaration; public|protected modifiers → exported
  • cCppTreeSitterParser.ts — combined C+C++ parser; prefers tree-sitter-c (smaller, 613 KB) for .c/.h files and falls back to tree-sitter-cpp (superset, 3.4 MB) for C++ extensions; handles nested declarator chains (pointer_declarator → function_declarator → identifier), qualified names (Widget::draw), template declarations
  • csTreeSitterParser.tsclass/interface/struct/record/enum/method/constructor_declaration; public|protected|internal modifier → exported
  • rubyTreeSitterParser.tsmethod, singleton_method (def self.name), class, module; scope-resolution names unwrapped; all exported (Ruby has no declaration-site visibility gate)
  • phpTreeSitterParser.ts — uses tree-sitter-php_only WASM so bare PHP code snippets parse without a leading <?php tag; function_definition, method_declaration, class/interface/trait/enum_declaration; private visibility → not exported

Registry update (structuralParserRegistry.ts)

  • Prepends each new tree-sitter parser before the existing regex fallback for java, cpp, cs, rb, php
  • All parsers surrender gracefully (return undefined) when the .wasm isn't cached — zero behaviour change for users who haven't run COCO_PREFETCH

WASM SHA-256 manifest pins

Language Package Version SHA-256
Java tree-sitter-java 0.23.5 4fdeac4c…
C tree-sitter-c 0.24.1 c852c2a8…
C++ tree-sitter-cpp 0.23.4 174eb0de…
C# tree-sitter-c-sharp 0.23.5 6f69e1ca…
Ruby tree-sitter-ruby 0.23.1 09a96427…
PHP tree-sitter-php (php_only) 0.24.2 fd1bcff3…

Testing

  • TypeScript type-check passes (tsc --noEmit --skipLibCheck)
  • ESLint passes on all modified/new files
  • All new parsers follow established registry pattern (tree-sitter first, regex fallback)
  • Graceful surrender when .wasm not cached (no behaviour change for existing users)
  • CI: pending

🤖 Generated by the harbor agent loop. Reviewed by a human before merge.

…y, PHP

Upgrades the six regex-first structural parsers to real AST extraction
via lazy-loaded tree-sitter grammars (COCO-1239 / #1239).

Infrastructure:
- Extends LazyTreeSitterLanguageId with java, c, cpp, csharp, ruby, php
- Adds SHA-256-pinned manifest entries for all six grammars
- Wires lazy cache paths into the runtime language resolver
- Adds COCO_PREFETCH aliases: java, c, cpp/c++/cxx, cs/csharp/c#, rb/ruby, php

New extractors:
- javaTreeSitterParser: class/interface/enum/record/method/constructor,
  public|protected visibility → exported
- cCppTreeSitterParser: combined C+C++ parser that tries tree-sitter-c
  for .c/.h files (smaller grammar) and tree-sitter-cpp for .cpp/.cc etc.
  (superset); handles function_definition with nested declarator chains,
  struct/class/enum/namespace/preproc_def, template declarations
- csTreeSitterParser: class/interface/struct/record/enum/method/constructor,
  public|protected|internal modifier → exported
- rubyTreeSitterParser: method/singleton_method/class/module, scope_resolution
  names unwrapped, all exported: true (Ruby has no static visibility gate)
- phpTreeSitterParser: uses tree-sitter-php_only grammar so bare PHP code
  snippets parse without a <?php tag; function/method/class/interface/trait/
  enum, private modifier → exported: false

All new parsers follow the established registry pattern: prepended before the
regex fallback; surrender gracefully (return undefined) when the .wasm isn't
cached so no behaviour change occurs for users who haven't prefetched.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@gfargo gfargo merged commit 7816d18 into main Jun 22, 2026
16 checks passed
@gfargo gfargo deleted the agent/coco-15-coco-1239-tree-sitter-ast-grammars-for-j branch June 22, 2026 12:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant