Summary
A persistent ref.toml (on a shared volume, shared between workers, CLI, API) should be durable across image upgrades. Today two separate things can take every process down after an otherwise-routine upgrade:
- Fields that reference packaged data files (currently:
dimensions_cv) get serialized as absolute venv paths that break when the Python version or venv layout changes.
- Fields that reference named modules (currently:
[[diagnostic_providers]]) hard-crash the whole process if any entry no longer resolves.
Both were hit during an upgrade from climate-ref-frontend:v0.2.x (py3.11) to v0.3.0 (py3.13). The full trail is in Climate-REF/ref-app#29.
1. Packaged data should behave like ignore_datasets_file, not a venv path
ignore_datasets_file already solves this cleanly (packages/climate-ref/src/climate_ref/config.py::_get_default_ignore_datasets_file): the default materializes a cached copy under REF_CONFIGURATION/cache/climate_ref/, i.e. on the persistent volume, not in the venv. The path written to ref.toml stays valid across image changes, and the loader re-materializes if the cached file is missing.
dimensions_cv uses a different pattern (DimensionsCV._dimensions_cv_factory): it resolves importlib.resources.files(\"climate_ref_core.pycmec\") / \"cv_cmip7_aft.yaml\", gets the resolved absolute path (e.g. /app/.venv/lib/python3.11/site-packages/...), and that path gets persisted into ref.toml. When the image moves to py3.13 the file is still present at the new venv path, but the stale py3.11 path in ref.toml wins and the service fails at startup with FileNotFoundError.
Generalize the ignore_datasets_file approach to any packaged data file referenced from ref.toml. Options:
- Introduce a helper (e.g.
_get_default_packaged_file(pkg, filename) -> Path) that materializes the package resource into REF_CONFIGURATION/cache/<pkg>/<filename> on first read, the way _get_default_ignore_datasets_file does for the grey list. dimensions_cv (and any future sibling) uses this instead of importlib.resources directly.
- Or don't serialize packaged-data fields to
ref.toml at all when the value equals the packaged default; let the loader re-resolve from importlib.resources every start.
- Or store a sentinel /
packaged:<pkg>/<file> URI rather than an absolute filesystem path; loader expands it at load time.
Any of these fixes the symptom. The first matches the existing grey-list pattern and keeps a stable on-disk artefact under REF_CONFIGURATION, which is the shape the deployment already expects.
2. Unknown providers should degrade, not crash
climate_ref_core.providers.import_provider raises InvalidProviderException on ModuleNotFoundError, and ProviderRegistry.build_from_config propagates it, so one stale entry in [[diagnostic_providers]] kills every process (API, workers, CLI) at boot.
Trigger: ref.toml still carried provider = \"climate_ref_example:provider\" after the image stopped shipping that package. Nothing worked until the entry was manually sed'd out.
Fix options (pick at least one):
- In
build_from_config (or the config loader), try/except ModuleNotFoundError per provider: log a warning, skip the entry, continue. A missing provider should not take down the whole stack.
- Add a
ref config providers prune command (and run it implicitly from ref providers setup) that removes entries whose module doesn't resolve, so ref.toml self-heals after upgrades that drop a provider.
- Discover providers by entry-point / distribution name rather than dotted module path, so
ref.toml doesn't have to redeclare what's installed.
Acceptance
- Path durability: a
ref.toml written by one Python / image loads cleanly under a different Python / image without operator intervention for any field that currently references package data. Covered by a round-trip test that simulates the py3.11 → py3.13 move for dimensions_cv.
- Provider resilience: a
ref.toml with one unknown provider and two valid ones starts successfully; the two valid providers register, the unknown one is logged and skipped. Covered by a unit test against build_from_config.
- Existing behavior unchanged when everything listed is installed.
Summary
A persistent
ref.toml(on a shared volume, shared between workers, CLI, API) should be durable across image upgrades. Today two separate things can take every process down after an otherwise-routine upgrade:dimensions_cv) get serialized as absolute venv paths that break when the Python version or venv layout changes.[[diagnostic_providers]]) hard-crash the whole process if any entry no longer resolves.Both were hit during an upgrade from
climate-ref-frontend:v0.2.x(py3.11) tov0.3.0(py3.13). The full trail is in Climate-REF/ref-app#29.1. Packaged data should behave like
ignore_datasets_file, not a venv pathignore_datasets_filealready solves this cleanly (packages/climate-ref/src/climate_ref/config.py::_get_default_ignore_datasets_file): the default materializes a cached copy underREF_CONFIGURATION/cache/climate_ref/, i.e. on the persistent volume, not in the venv. The path written toref.tomlstays valid across image changes, and the loader re-materializes if the cached file is missing.dimensions_cvuses a different pattern (DimensionsCV._dimensions_cv_factory): it resolvesimportlib.resources.files(\"climate_ref_core.pycmec\") / \"cv_cmip7_aft.yaml\", gets the resolved absolute path (e.g./app/.venv/lib/python3.11/site-packages/...), and that path gets persisted intoref.toml. When the image moves to py3.13 the file is still present at the new venv path, but the stale py3.11 path inref.tomlwins and the service fails at startup withFileNotFoundError.Generalize the
ignore_datasets_fileapproach to any packaged data file referenced fromref.toml. Options:_get_default_packaged_file(pkg, filename) -> Path) that materializes the package resource intoREF_CONFIGURATION/cache/<pkg>/<filename>on first read, the way_get_default_ignore_datasets_filedoes for the grey list.dimensions_cv(and any future sibling) uses this instead ofimportlib.resourcesdirectly.ref.tomlat all when the value equals the packaged default; let the loader re-resolve fromimportlib.resourcesevery start.packaged:<pkg>/<file>URI rather than an absolute filesystem path; loader expands it at load time.Any of these fixes the symptom. The first matches the existing grey-list pattern and keeps a stable on-disk artefact under
REF_CONFIGURATION, which is the shape the deployment already expects.2. Unknown providers should degrade, not crash
climate_ref_core.providers.import_providerraisesInvalidProviderExceptiononModuleNotFoundError, andProviderRegistry.build_from_configpropagates it, so one stale entry in[[diagnostic_providers]]kills every process (API, workers, CLI) at boot.Trigger:
ref.tomlstill carriedprovider = \"climate_ref_example:provider\"after the image stopped shipping that package. Nothing worked until the entry was manuallysed'd out.Fix options (pick at least one):
build_from_config(or the config loader),try/except ModuleNotFoundErrorper provider: log a warning, skip the entry, continue. A missing provider should not take down the whole stack.ref config providers prunecommand (and run it implicitly fromref providers setup) that removes entries whose module doesn't resolve, soref.tomlself-heals after upgrades that drop a provider.ref.tomldoesn't have to redeclare what's installed.Acceptance
ref.tomlwritten by one Python / image loads cleanly under a different Python / image without operator intervention for any field that currently references package data. Covered by a round-trip test that simulates the py3.11 → py3.13 move fordimensions_cv.ref.tomlwith one unknown provider and two valid ones starts successfully; the two valid providers register, the unknown one is logged and skipped. Covered by a unit test againstbuild_from_config.