Simplify Extended_ast vs Std_ast by hhugo · Pull Request #2791 · ocaml-ppx/ocamlformat

hhugo · 2026-03-26T03:09:13Z

While working on adding support for mll and mly files. I got very confused with the distinction between Extented_ast and Std_ast.

I started following what was done for Repl_file and Documentation in Std_ast, that is, "no support". But then realized that this mean that we don't check the ast is preserved. (#2557 (comment))

I figured it was more natural to keep the use of Std_ast an implementation detail of the equivalence check for some of the file format in Extended_ast only.

This is an attempt at improving the current situation.

Commits

Remove unused functions from Extended_ast and Normalize_extended_ast — drop equal_core_type, equal, Normalize_extended_ast.equal/ast
Break Source -> Extended_ast dependency cycle — open Parsetree instead of open Extended_ast
Break Normalize_std_ast -> Normalize_extended_ast dependency — make normalize_code a parameter instead of importing it
Add paired GADT to Extended_ast — ('a, 'b) paired record, Parse.ast produces both ASTs, update Fmt_ast/tests
Add equivalent, dump, dump_normalized, get_std — AST preservation check using embedded std ASTs
Absorb Parse_with_comments + simplify Translation_unit — move parsing pipeline into Extended_ast, delete Parse_with_comments, remove std_fg/std_parsed threading
Clean up Std_ast — remove Repl_file, Documentation, of_syntax, any_t
Document why Std_ast exists — add docstring explaining its role in semantic preservation

Summary

Make Extended_ast the single entry point for parsing, validation, and AST operations
Embed both extended and standard ASTs in a ('a, 'b) paired record type, so the standard AST is produced alongside the extended one during parsing
Absorb Parse_with_comments into Extended_ast — Documentation bypasses the OCaml pipeline entirely (no hash-bang, w50, or token collection)
Add Extended_ast.equivalent to check AST preservation, dump/dump_normalized for debug output
Simplify Translation_unit: remove std_fg/std_parsed threading, no longer references Std_ast directly
Clean up Std_ast: remove unused constructors (Repl_file, Documentation), of_syntax, comment collection
Break dependency cycles (Source -> Extended_ast, Normalize_std_ast -> Normalize_extended_ast)
Remove dead code (Extended_ast.equal, equal_core_type, Normalize_extended_ast.equal/ast)

Test plan

`dune runtest` passes (204/204 tests)
No behavior change for formatted output
`printast` tool updated to use new API

hhugo · 2026-03-26T03:28:20Z

lib/Extended_ast.ml

+    (new_v : a) : ast_check_result =
+  match get_std_pair fg old_v new_v with
+  | None ->
+      (* TODO: Repl_file and Documentation have no std AST, so we skip the


I believe this is a preexisting bug

hhugo · 2026-03-26T03:29:38Z

lib/Extended_ast.ml

+let dump (type a) (fg : a t) fmt (v : a) =
+  match get_std fg v with
+  | Some (Std_value (std_fg, std_v)) -> Std_ast.Printast.ast std_fg fmt std_v
+  | None -> Printast.ast fg fmt v


We now properly dump ast for Documentation and Repl_file

hhugo · 2026-03-26T03:30:57Z

lib/Parse_with_comments.ml

-          List.map ~f:mk_cmt (Lexer.comments ())
-        in
-        let tokens =
-          (* mld files can not always be lexed using the ocaml lexer *)


This is no longer true, parsing documentation no longer goes through this help function.

EmileTrotignon · 2026-03-26T09:56:54Z

I know this part of the code was a bit convoluted, thanks for trying to simplify it. Looks a bit hard to review though

Remove `equal_core_type` and `equal` from Extended_ast, and `ast` and `equal` from Normalize_extended_ast. These functions are not used anywhere in the codebase. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Julow · 2026-03-26T11:14:23Z

lib/Extended_ast.mli

+(** A parsed fragment carrying both the extended AST (used for formatting)
+    and the standard AST (used to verify formatting preserves semantics,
+    see {!Std_ast}). *)
+type ('a, 'b) paired = {extended: 'a; std: 'b}


I'm not sure to understand what this type provides. Do we want to keep the two AST paired anywhere outside of Translation_unit ?
What are the benefits from moving the code from Parse_with_comments into Extended_ast ?

Parse_with_comments.parse should only be used for "ocaml code", code parsed by the ocaml parser/lexer. It should not be used for Documentation for instance (it's wrong in the main branch).

We need to dispatch on the syntax to call it only when appropriate, the dispatch is already done in Extended_ast, it feels natural to move it there.

I'm not sure to understand what this type provides. Do we want to keep the two AST paired anywhere outside of Translation_unit ?

I've been trying to minimize behavior changes in the PR.
We could probably parse the Std_ast only when needed in equivalent, saving the parsing for files already properly formatted. If we go that direction, we would need to have access to the source as string so that we can parse the std_ast.

We do have the source as string in Translation_unit, both before and after formatting. Perhaps Parse_with_comments could be extended with the dispatch ? (and possibly renamed)

I can possible move back some parsing logic into Parse_with_comments. But I'm not sure it gives much benefit.

I've added back Parse_with_comments.. Let me know what you think

I'm not sure to see the point of keeping both Parse_with_comments and the paired type. I think I prefer to keep Parse_with_comments because otherwise, we just add a lot of code into Extended_ast that in fact work with something else than the extended ast.

I don't think the equivalent function make much sense. It should be 'a Std_ast.t -> ... -> 'a -> 'a -> ast_check_result and not work with the extended AST at all.
It also can reasonably be string -> string -> .. and we could unify Std_ast.t and Extended_ast.t:

val ext_parse : ('ext, 'std) t -> string -> 'ext val std_parse : ('ext, 'std) t -> string -> 'std val equivalent : ('ext, 'std) t -> ... -> string -> string (* or 'std -> 'std -> *) -> ast_check_result

The equivalent signature you propose forces to parse the same string multiple time when iterating to find a fix point. This is something Ive optimized long time ago in the early days of ocamlformat.

equivalent can't take a 'std because not all format have one.

Extracting comment is very much linked to the implementation of the parser. It seems unnatural to have the logic spread across to files.

Parse_with_comment and paired are orthogonal.

because otherwise, we just add a lot of code into Extended_ast that in fact work with something else than the extended ast.

The Std_ast is pretty much an implementation detail of the check of some of the file format

@Julow, think more about this. I wonder if we should inline fields from the "with_comment" record into Extended_ast.t.

the prefix is only relevent for use_file (and structure maybe)

comments is not relevent for documentation I think

Source is a bit weird because it deal with "ocaml" token. which we don't exactly have/want for mll mly files for instance.

The change would mean that an Extended_ast.t contains every thing needed for printing and checking.

Source.ml only needs Parsetree types (locations, tokens), not the full Extended_ast module. Use `open Parsetree` directly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Make `normalize_code` a parameter of Normalize_std_ast functions instead of importing it from Normalize_extended_ast. This decouples the two normalization modules, with callers responsible for passing the normalize_code function. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Extended_ast.t GADT now uses paired wrapper types that embed both the extended and standard ASTs: `('a, 'b) paired = {extended: 'a; std: 'b}`. For the 7 OCaml fragment types (Structure, Signature, Use_file, Core_type, Module_type, Expression, Pattern), parsing produces both ASTs in one operation. Repl_file and Documentation have no std counterpart. Parse.ast now internally calls Std_ast.Parse.ast. All code that consumes the AST (map, Printast.ast, Fmt_ast, Cmts) works through the paired type, accessing .extended for formatting. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add functions that use the embedded std ASTs from the paired GADT: - `get_std`: extract the std AST with its Std_ast.t witness - `dump` / `dump_normalized`: print std AST for debug output - `equivalent`: check AST preservation using the std ASTs, returning Ast_preserved | Docstrings_moved _ | Ast_changed These will replace the manual Normalize_std_ast calls in Translation_unit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Move the parsing pipeline (warnings, hash-bang, tokens, Source.t creation) from Parse_with_comments into Extended_ast: - Extended_ast.parse is the full parsing pipeline - Extended_ast.parse_toplevel dispatches Use_file vs Repl_file - Extended_ast.Parsed.t replaces Parse_with_comments.with_comments - Extended_ast.Warning50 replaces Parse_with_comments.Warning50 - Delete Parse_with_comments.ml/.mli Simplify Translation_unit: remove std_fg/std_parsed parameters from format and parse_and_format, use Extended_ast.equivalent for AST validation. Translation_unit no longer references Std_ast directly. Rewrite printast tool to use the unified Extended_ast.parse API. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Std_ast becomes a purely internal implementation detail. Remove constructors and functions that are no longer needed now that Translation_unit no longer references Std_ast directly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add docstring explaining that the standard parser is used to verify formatting preserves semantics: two programs are equivalent if the compiler parses them identically. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

hhugo · 2026-03-26T11:39:13Z

I know this part of the code was a bit convoluted, thanks for trying to simplify it. Looks a bit hard to review though

The PR is now split into more digest commits

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Re-create Parse_with_comments with the parsing pipeline (Parsed type, Warning50 exception, parse, parse_toplevel) that was absorbed into Extended_ast. Re-export Extended_ast.Parse in the .mli. Update all callers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

hhugo commented Mar 26, 2026

View reviewed changes

Julow reviewed Mar 26, 2026

View reviewed changes

hhugo and others added 7 commits March 26, 2026 12:14

Break Source -> Extended_ast dependency cycle

aa17389

Source.ml only needs Parsetree types (locations, tokens), not the full Extended_ast module. Use `open Parsetree` directly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Document why Std_ast exists

9c3fbe0

Add docstring explaining that the standard parser is used to verify formatting preserves semantics: two programs are equivalent if the compiler parses them identically. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

hhugo force-pushed the simplify-extended-std-ast branch from e6c6c94 to 9c3fbe0 Compare March 26, 2026 11:48

hhugo and others added 2 commits March 26, 2026 12:51

Add doc comment to the paired type in Extended_ast

b2aaca5

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Misc: fmt

6c9173a

hhugo mentioned this pull request Mar 26, 2026

Add ocamllex (.mll) file formatting support #2792

Draft

4 tasks

Conversation

hhugo commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Commits

Summary

Test plan

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

EmileTrotignon commented Mar 26, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hhugo commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hhugo commented Mar 26, 2026 •

edited

Loading