perf(ingest): drop redundant per-dataset validate_data_catalog call#675
Merged
Conversation
register_dataset re-validated every per-dataset slice produced by _ingest_catalog's groupby, even though the parent catalog (or streamed chunk) had already been validated by ingest_datasets/the CLI. The re-check (required columns + per-dataset metadata uniqueness) is mathematically a subset of what runs at chunk/whole-catalog level, so the inner call only ever duplicates work in production. Move the responsibility onto the caller, document the precondition, and add an explicit validate_data_catalog step to the seeded-DB test fixture which was the only path that previously relied on the inner call. Measured impact on a synthetic 50k-file / 500-dataset CMIP6 ingest: 13.88s -> 10.88s (-21.6%); validate_data_catalog call count 505 -> 5.
Codecov Report✅ All modified and coverable lines are covered by tests.
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
Add back a cheap defensive guard in register_dataset that catches callers which bypass the upstream validate_data_catalog contract and hand in a single-slug slice with inconsistent dataset-specific metadata. Uses a direct nunique on the slice (no groupby) so the work is O(rows x cols) for a single dataset rather than the full-catalog groupby that was just removed - cost stays in the microseconds and the 21.6% wall-time win on the synthetic 50k/500 benchmark is preserved. Adds a unit test that constructs an inconsistent slice and asserts the guard raises with a precise error message.
Switch the per-slice metadata-consistency guard from ValueError to RefException so it matches the multi-slug guard already in this function and the rest of the ingest error surface. Docstring now documents the precondition and the exception explicitly. Test updated to match.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
register_datasetre-ranvalidate_data_catalogon every per-dataset slice produced by_ingest_catalog'sgroupby(slug). The parent catalog (or each streamed chunk) is already validated up-front by the CLI /ingest_datasets, so the inner re-check (required columns + per-dataset metadata uniqueness) was mathematically a subset of work already done.packages/climate-ref/conftest.py, which built a catalog withfind_local_datasetsand skipped explicit validation. That fixture now validates the catalog once before iterating.Measured impact
Synthetic 50k-file / 500-dataset CMIP6 DRS archive on NVMe, streaming with
chunk_size=10_000:register_datasetcumtimevalidate_data_catalogcallsBehavioral notes
--skip-invalid=Falsestill raises on bad data — the exception now originates from the chunk-level / whole-catalog validate (one call up the stack) instead of the redundant per-slice one.RefExceptionat the top ofregister_dataset.validate_data_catalog, and feed inconsistent dataset-specific metadata for that slug. In that case the function now silently picksiloc[0]instead of raising.Test plan
uv run pytest packages/climate-ref/tests/unit/datasets/— 199 passed, 1 xfaileduv run pytest packages/climate-ref/tests/unit/cli/test_datasets.py packages/climate-ref/tests/integration— 95 passedmake pre-commiton touched files — clean