Skip to content

Comments

Formatter / schema.org / Add croissant spec 🥐#8939

Open
fxprunayre wants to merge 10 commits intomainfrom
44-croissant
Open

Formatter / schema.org / Add croissant spec 🥐#8939
fxprunayre wants to merge 10 commits intomainfrom
44-croissant

Conversation

@fxprunayre
Copy link
Member

@fxprunayre fxprunayre commented Jul 16, 2025

"Croissant 🥐 is a high-level format for machine learning datasets that combines metadata, resource file descriptions, data structure, and default ML semantics into a single file; it works with existing datasets to make them easier to find, use, and support with tools. Croissant builds on schema.org, and its Dataset vocabulary, a widely used format to represent datasets on the Web, and make them searchable." https://docs.mlcommons.org/croissant/

Croissant is extending schema.org, this improvement review the current schema.org formatter to support additional 🥐 metadata available in ISO format. This is mainly about adding:

  • croissant fileObject based on online resources with a download protocol
  • croissant recordSet based on the feature catalogue

Refactor JSON-LD formatter for using same base formatter for both ISO19139 and ISO19115-3 to facilitate maintenance (similar to citation and DCAT formatter).

Improve formatter producing JSON output by ensuring the output is JSON valid, format it and log any error in order to be able to track errors and improve not well managed encoding.

schema.org improvement:

  • inLanguage correspond to the resource language, not the metadata language.
  • dispatch parties by role instead of only using producer (eg. provider, producer, copyrightHolder, publisher, author, funder)
  • do not generate element (eg. temporalCoverage) if no corresponding element in input document
  • add dublin core schema plugin support

Similar initiatives:

Checklist

  • I have read the contribution guidelines
  • Pull request provided for main branch, backports managed with label
  • Good housekeeping of code, cleaning up comments, tests, and documentation
  • Clean commit history broken into understandable chucks, avoiding big commits with hundreds of files, cautious of reformatting and whitespace changes
  • Clean commit messages, longer verbose messages are encouraged
  • API Changes are identified in commit messages
  • Testing provided for features or enhancements using automatic tests
  • User documentation provided for new features or enhancements in manual
  • Build documentation provided for development instructions in README.md files
  • Library management using pom.xml dependency management. Update build documentation with intended library use and library tutorials or documentation

Funded by BRGM & Ifremer.

@fxprunayre fxprunayre added this to the 4.4.9 milestone Jul 16, 2025
"Croissant 🥐 is a high-level format for machine learning datasets that combines metadata, resource file descriptions, data structure, and default ML semantics into a single file; it works with existing datasets to make them easier to find, use, and support with tools. Croissant builds on schema.org, and its Dataset vocabulary, a widely used format to represent datasets on the Web, and make them searchable. "

Croissant is extending schema.org, this improvement review the current schema.org formatter to support additional 🥐 metadata available in ISO format. This is mainly about adding:
* croissant fileObject based on online resources with a download protocol
* croissant recordSet based on the feature catalogue

Refactor JSON-LD formatter for using same base formatter for both ISO19139 and ISO19115-3 to facilitate maintenance (similar to citation and DCAT formatter).
Formatter producing JSON may produce invalid document as XSLT process output text which is written in the response. Ensure the JSON is valid and format it. Log any error, to be able to monitor them and improve the formatter for not well managed encoding.

In future version, consider using XSLT3 which support JSON output (https://www.w3.org/TR/xslt-30/#json).
@sonarqubecloud
Copy link

sonarqubecloud bot commented Aug 8, 2025

@sonarqubecloud
Copy link

Quality Gate Failed Quality Gate failed

Failed conditions
0.0% Coverage on New Code (required ≥ 80%)

See analysis details on SonarQube Cloud

@fxprunayre fxprunayre marked this pull request as ready for review September 10, 2025 08:56
Copy link
Contributor

@jodygarnett jodygarnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please include docs in PR.

Updated the search engines section with details on robots.txt and sitemap usage, including examples for better indexing.
@fxprunayre
Copy link
Member Author

Please include docs in PR.

Additional details about SEO & schema.org added. See https://github.com/geonetwork/core-geonetwork/blob/44-croissant/docs/manual/docs/tutorials/introduction/extra/index.md#search-engines

In GN5, maybe at some point we should create dedicated section to each formatters to explain the context and usages.

@jmckenna
Copy link

@fxprunayre thanks. One comment:

  • I was thinking that "GeoNetwork includes on any html representation of a metadata record a representation of that record in schema.org encoded as json-ld." should be changed to "GeoNetwork includes a representation of that record in schema.org encoded as JSON-LD, on any HTML representation of a published metadata record."

Note my mention of published (and added link for JSON-LD).

Of course I could be wrong, but, this is what I have found. Feel free to correct me here :)

@jmckenna
Copy link

I might even expand that (as not everyone knows what "schema.org" is) to :"GeoNetwork includes a representation of that record in the schema.org (Structured Data for the Web) framework, encoded as JSON-LD, on any HTML representation of a published metadata record."

@jmckenna
Copy link

And further expanded (as the JSON-LD is only embedded on API-rendered HTML pages, not "any" record's HTML page): "GeoNetwork includes a representation of that record in the schema.org (Structured Data for the Web) framework, encoded as JSON-LD, on any HTML representation (through the GeoNetwork API / sitemap) of a published metadata record."

@jahow jahow modified the milestones: 4.4.9, 4.4.10 Oct 7, 2025
@KoalaGeo
Copy link
Contributor

Out of interest, is this still on track @fxprunayre ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants