Skip to content

Proposal: published extension schema should be self-contained #5

@jisantuc

Description

@jisantuc

Currently the extension schemata are a mix of self-contained files (like file and label) and schema requiring arbitrary URI resolution (like tiled assets and card4l). If we use remote references in the published schemata, we expose ourselves to two kinds of risk:

  1. Failure to read a URI can happen for way more reasons. The URI could be behind an authenticated endpoint. The server could be down. Someone could have replaced the content at the URI with something else by accident or maliciously. These risks multiply with each URI we have to read. In the authenticated case, open source servers like Franklin and stac-fastapi would need some way to authenticate. Figuring out how to provide that is hard. (This is still a problem with self-contained extensions, but less so, because there are fewer links.)
  2. Deep / wide trees of refs increase the latency for validating an item. For example, tiled-assets references the item schema, which references remote schemata for geojson features (by url), basics, datetime, instrument, licensing, and provider (by relative path), and the catalog schema, which references the catalog-core schema. So to take one JSON item and validate it against the tiled-assets extension (the first time -- obviously these things can be cached), I have to make ten http requests.

Additionally, there are varying degrees of JSON schema remote $ref support in common languages used for STAC:

  • Everit (Java) which backs circe-json-schema (Scala) desires to read refs as file paths
  • Ajv (JS) allows providing an arbitrary loading function but the link explaining the option 404s. This shifts the complexity onto the user, who is responsible for correctly interpreting each ref.
  • JSON Schema in python makes some guesses about what kind of ref you have and attempts to resolve
  • I don't know anything about C#, PHP, or R support, notes welcome.

The cost of doing away with remote refs everywhere is duplication and no more inheritance. That's a pretty hefty cost, which is why I'm only proposing that published schemata be self-contained. In particular:

  • the repository versions of the schema can still refer to whatever they want, but
  • the template should have node scripts for inlining all schema referenced

The benefits of inlining will be that any language with a tool that can load a JSON schema from JSON will be equally supported for STAC tooling work, and servers won't have to do as much work the first time they see a schema URL.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions