Skip to content

Simplify Extended_ast vs Std_ast#2791

Open
hhugo wants to merge 11 commits intomainfrom
simplify-extended-std-ast
Open

Simplify Extended_ast vs Std_ast#2791
hhugo wants to merge 11 commits intomainfrom
simplify-extended-std-ast

Conversation

@hhugo
Copy link
Copy Markdown
Collaborator

@hhugo hhugo commented Mar 26, 2026

While working on adding support for mll and mly files. I got very confused with the distinction between Extented_ast and Std_ast.

I started following what was done for Repl_file and Documentation in Std_ast, that is, "no support". But then realized that this mean that we don't check the ast is preserved. (#2557 (comment))

I figured it was more natural to keep the use of Std_ast an implementation detail of the equivalence check for some of the file format in Extended_ast only.

This is an attempt at improving the current situation.

Commits

  1. Remove unused functions from Extended_ast and Normalize_extended_ast — drop equal_core_type, equal, Normalize_extended_ast.equal/ast
  2. Break Source -> Extended_ast dependency cycleopen Parsetree instead of open Extended_ast
  3. Break Normalize_std_ast -> Normalize_extended_ast dependency — make normalize_code a parameter instead of importing it
  4. Add paired GADT to Extended_ast('a, 'b) paired record, Parse.ast produces both ASTs, update Fmt_ast/tests
  5. Add equivalent, dump, dump_normalized, get_std — AST preservation check using embedded std ASTs
  6. Absorb Parse_with_comments + simplify Translation_unit — move parsing pipeline into Extended_ast, delete Parse_with_comments, remove std_fg/std_parsed threading
  7. Clean up Std_ast — remove Repl_file, Documentation, of_syntax, any_t
  8. Document why Std_ast exists — add docstring explaining its role in semantic preservation

Summary

  • Make Extended_ast the single entry point for parsing, validation, and AST operations
  • Embed both extended and standard ASTs in a ('a, 'b) paired record type, so the standard AST is produced alongside the extended one during parsing
  • Absorb Parse_with_comments into Extended_astDocumentation bypasses the OCaml pipeline entirely (no hash-bang, w50, or token collection)
  • Add Extended_ast.equivalent to check AST preservation, dump/dump_normalized for debug output
  • Simplify Translation_unit: remove std_fg/std_parsed threading, no longer references Std_ast directly
  • Clean up Std_ast: remove unused constructors (Repl_file, Documentation), of_syntax, comment collection
  • Break dependency cycles (Source -> Extended_ast, Normalize_std_ast -> Normalize_extended_ast)
  • Remove dead code (Extended_ast.equal, equal_core_type, Normalize_extended_ast.equal/ast)

Test plan

  • `dune runtest` passes (204/204 tests)
  • No behavior change for formatted output
  • `printast` tool updated to use new API

(new_v : a) : ast_check_result =
match get_std_pair fg old_v new_v with
| None ->
(* TODO: Repl_file and Documentation have no std AST, so we skip the
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is a preexisting bug

let dump (type a) (fg : a t) fmt (v : a) =
match get_std fg v with
| Some (Std_value (std_fg, std_v)) -> Std_ast.Printast.ast std_fg fmt std_v
| None -> Printast.ast fg fmt v
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now properly dump ast for Documentation and Repl_file

List.map ~f:mk_cmt (Lexer.comments ())
in
let tokens =
(* mld files can not always be lexed using the ocaml lexer *)
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer true, parsing documentation no longer goes through this help function.

@EmileTrotignon
Copy link
Copy Markdown
Collaborator

I know this part of the code was a bit convoluted, thanks for trying to simplify it. Looks a bit hard to review though

Remove `equal_core_type` and `equal` from Extended_ast, and `ast` and
`equal` from Normalize_extended_ast. These functions are not used
anywhere in the codebase.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
(** A parsed fragment carrying both the extended AST (used for formatting)
and the standard AST (used to verify formatting preserves semantics,
see {!Std_ast}). *)
type ('a, 'b) paired = {extended: 'a; std: 'b}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure to understand what this type provides. Do we want to keep the two AST paired anywhere outside of Translation_unit ?
What are the benefits from moving the code from Parse_with_comments into Extended_ast ?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parse_with_comments.parse should only be used for "ocaml code", code parsed by the ocaml parser/lexer. It should not be used for Documentation for instance (it's wrong in the main branch).

We need to dispatch on the syntax to call it only when appropriate, the dispatch is already done in Extended_ast, it feels natural to move it there.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure to understand what this type provides. Do we want to keep the two AST paired anywhere outside of Translation_unit ?

I've been trying to minimize behavior changes in the PR.
We could probably parse the Std_ast only when needed in equivalent, saving the parsing for files already properly formatted. If we go that direction, we would need to have access to the source as string so that we can parse the std_ast.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do have the source as string in Translation_unit, both before and after formatting. Perhaps Parse_with_comments could be extended with the dispatch ? (and possibly renamed)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can possible move back some parsing logic into Parse_with_comments. But I'm not sure it gives much benefit.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added back Parse_with_comments.. Let me know what you think

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure to see the point of keeping both Parse_with_comments and the paired type. I think I prefer to keep Parse_with_comments because otherwise, we just add a lot of code into Extended_ast that in fact work with something else than the extended ast.

I don't think the equivalent function make much sense. It should be 'a Std_ast.t -> ... -> 'a -> 'a -> ast_check_result and not work with the extended AST at all.
It also can reasonably be string -> string -> .. and we could unify Std_ast.t and Extended_ast.t:

val ext_parse : ('ext, 'std) t -> string -> 'ext
val std_parse : ('ext, 'std) t -> string -> 'std

val equivalent :
     ('ext, 'std) t
  -> ...
  -> string
  -> string
  (* or 'std -> 'std -> *)
  -> ast_check_result

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The equivalent signature you propose forces to parse the same string multiple time when iterating to find a fix point. This is something Ive optimized long time ago in the early days of ocamlformat.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • equivalent can't take a 'std because not all format have one.
  • Extracting comment is very much linked to the implementation of the parser. It seems unnatural to have the logic spread across to files.
  • Parse_with_comment and paired are orthogonal.

because otherwise, we just add a lot of code into Extended_ast that in fact work with something else than the extended ast.

The Std_ast is pretty much an implementation detail of the check of some of the file format

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Julow, think more about this. I wonder if we should inline fields from the "with_comment" record into Extended_ast.t.

  • the prefix is only relevent for use_file (and structure maybe)
  • comments is not relevent for documentation I think
  • Source is a bit weird because it deal with "ocaml" token. which we don't exactly have/want for mll mly files for instance.

The change would mean that an Extended_ast.t contains every thing needed for printing and checking.

hhugo and others added 7 commits March 26, 2026 12:14
Source.ml only needs Parsetree types (locations, tokens), not the full
Extended_ast module. Use `open Parsetree` directly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Make `normalize_code` a parameter of Normalize_std_ast functions instead
of importing it from Normalize_extended_ast. This decouples the two
normalization modules, with callers responsible for passing the
normalize_code function.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extended_ast.t GADT now uses paired wrapper types that embed both the
extended and standard ASTs: `('a, 'b) paired = {extended: 'a; std: 'b}`.
For the 7 OCaml fragment types (Structure, Signature, Use_file,
Core_type, Module_type, Expression, Pattern), parsing produces both
ASTs in one operation. Repl_file and Documentation have no std
counterpart.

Parse.ast now internally calls Std_ast.Parse.ast. All code that
consumes the AST (map, Printast.ast, Fmt_ast, Cmts) works through the
paired type, accessing .extended for formatting.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add functions that use the embedded std ASTs from the paired GADT:

- `get_std`: extract the std AST with its Std_ast.t witness
- `dump` / `dump_normalized`: print std AST for debug output
- `equivalent`: check AST preservation using the std ASTs, returning
  Ast_preserved | Docstrings_moved _ | Ast_changed

These will replace the manual Normalize_std_ast calls in
Translation_unit.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move the parsing pipeline (warnings, hash-bang, tokens, Source.t
creation) from Parse_with_comments into Extended_ast:

- Extended_ast.parse is the full parsing pipeline
- Extended_ast.parse_toplevel dispatches Use_file vs Repl_file
- Extended_ast.Parsed.t replaces Parse_with_comments.with_comments
- Extended_ast.Warning50 replaces Parse_with_comments.Warning50
- Delete Parse_with_comments.ml/.mli

Simplify Translation_unit: remove std_fg/std_parsed parameters from
format and parse_and_format, use Extended_ast.equivalent for AST
validation. Translation_unit no longer references Std_ast directly.

Rewrite printast tool to use the unified Extended_ast.parse API.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Std_ast becomes a purely internal implementation detail. Remove
constructors and functions that are no longer needed now that
Translation_unit no longer references Std_ast directly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add docstring explaining that the standard parser is used to verify
formatting preserves semantics: two programs are equivalent if the
compiler parses them identically.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@hhugo
Copy link
Copy Markdown
Collaborator Author

hhugo commented Mar 26, 2026

I know this part of the code was a bit convoluted, thanks for trying to simplify it. Looks a bit hard to review though

The PR is now split into more digest commits

@hhugo hhugo force-pushed the simplify-extended-std-ast branch from e6c6c94 to 9c3fbe0 Compare March 26, 2026 11:48
hhugo and others added 2 commits March 26, 2026 12:51
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Re-create Parse_with_comments with the parsing pipeline (Parsed type,
Warning50 exception, parse, parse_toplevel) that was absorbed into
Extended_ast. Re-export Extended_ast.Parse in the .mli. Update all
callers.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants