Skip to content

perf(ingest): drop redundant per-dataset validate_data_catalog call#675

Merged
lewisjared merged 5 commits into
mainfrom
perf/skip-redundant-validate
May 14, 2026
Merged

perf(ingest): drop redundant per-dataset validate_data_catalog call#675
lewisjared merged 5 commits into
mainfrom
perf/skip-redundant-validate

Conversation

@lewisjared
Copy link
Copy Markdown
Contributor

@lewisjared lewisjared commented May 14, 2026

Summary

  • register_dataset re-ran validate_data_catalog on every per-dataset slice produced by _ingest_catalog's groupby(slug). The parent catalog (or each streamed chunk) is already validated up-front by the CLI / ingest_datasets, so the inner re-check (required columns + per-dataset metadata uniqueness) was mathematically a subset of work already done.
  • The only path that previously relied on the inner call was the seeded-DB test fixture in packages/climate-ref/conftest.py, which built a catalog with find_local_datasets and skipped explicit validation. That fixture now validates the catalog once before iterating.

Measured impact

Synthetic 50k-file / 500-dataset CMIP6 DRS archive on NVMe, streaming with chunk_size=10_000:

Before After
Total elapsed 13.88 s 10.88 s (-21.6%)
register_dataset cumtime 8.03 s 5.09 s
validate_data_catalog calls 505 5 (chunk-level only)

Behavioral notes

  • CLI users see no change. --skip-invalid=False still raises on bad data — the exception now originates from the chunk-level / whole-catalog validate (one call up the stack) instead of the redundant per-slice one.
  • Multi-slug-per-slice misuse is still caught immediately by the existing RefException at the top of register_dataset.
  • The only behavioral regression is for external callers that build a single-slug slice by hand, skip validate_data_catalog, and feed inconsistent dataset-specific metadata for that slug. In that case the function now silently picks iloc[0] instead of raising.

Test plan

  • uv run pytest packages/climate-ref/tests/unit/datasets/ — 199 passed, 1 xfailed
  • uv run pytest packages/climate-ref/tests/unit/cli/test_datasets.py packages/climate-ref/tests/integration — 95 passed
  • make pre-commit on touched files — clean

register_dataset re-validated every per-dataset slice produced by
_ingest_catalog's groupby, even though the parent catalog (or streamed
chunk) had already been validated by ingest_datasets/the CLI. The
re-check (required columns + per-dataset metadata uniqueness) is
mathematically a subset of what runs at chunk/whole-catalog level, so
the inner call only ever duplicates work in production.

Move the responsibility onto the caller, document the precondition, and
add an explicit validate_data_catalog step to the seeded-DB test fixture
which was the only path that previously relied on the inner call.

Measured impact on a synthetic 50k-file / 500-dataset CMIP6 ingest:
13.88s -> 10.88s (-21.6%); validate_data_catalog call count 505 -> 5.
@codecov
Copy link
Copy Markdown

codecov Bot commented May 14, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

Flag Coverage Δ
core 93.09% <100.00%> (+<0.01%) ⬆️
providers 91.85% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...kages/climate-ref/src/climate_ref/datasets/base.py 98.15% <100.00%> (+0.02%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Add back a cheap defensive guard in register_dataset that catches
callers which bypass the upstream validate_data_catalog contract and
hand in a single-slug slice with inconsistent dataset-specific metadata.
Uses a direct nunique on the slice (no groupby) so the work is O(rows x
cols) for a single dataset rather than the full-catalog groupby that
was just removed - cost stays in the microseconds and the 21.6%
wall-time win on the synthetic 50k/500 benchmark is preserved.

Adds a unit test that constructs an inconsistent slice and asserts the
guard raises with a precise error message.
Switch the per-slice metadata-consistency guard from ValueError to
RefException so it matches the multi-slug guard already in this
function and the rest of the ingest error surface. Docstring now
documents the precondition and the exception explicitly. Test updated
to match.
@lewisjared lewisjared merged commit 8390552 into main May 14, 2026
26 checks passed
@lewisjared lewisjared deleted the perf/skip-redundant-validate branch May 14, 2026 12:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant