Skip to content

feat(waterdata): add get_waterdata for generalized CQL2 queries#284

Draft
thodson-usgs wants to merge 2 commits into
DOI-USGS:mainfrom
thodson-usgs:worktree-get-waterdata-cql
Draft

feat(waterdata): add get_waterdata for generalized CQL2 queries#284
thodson-usgs wants to merge 2 commits into
DOI-USGS:mainfrom
thodson-usgs:worktree-get-waterdata-cql

Conversation

@thodson-usgs
Copy link
Copy Markdown
Collaborator

Summary

Adds get_waterdata(service, cql, ...) — Python analogue of R dataRetrieval::read_waterdata. The typed wrappers (get_daily, get_continuous, get_peaks, …) only support exact-equality predicates on whitelisted parameters. Some users need more expressive queries:

  • top-level or instead of just and
  • like with % wildcards (e.g. HUC prefix match)
  • comparison operators (<, >, between)
  • nested boolean trees
  • geometry predicates beyond a bbox

This function gives them a single entry point that accepts a raw CQL2 query (either a Python dict or a pre-serialized JSON string), POSTs it against any recognized OGC collection, walks pages, and runs the same post-processing pipeline (_deal_with_empty_type_cols_arrange_cols_sort_rows) the typed wrappers use.

CQL2 grammar reference: https://api.waterdata.usgs.gov/docs/ogcapi/complex-queries/

Examples

from dataretrieval import waterdata

# 1. Daily values for two parameter codes at two sites — compound AND-of-INs.
cql = {
    "op": "and",
    "args": [
        {"op": "in", "args": [
            {"property": "parameter_code"},
            ["00060", "00065"],
        ]},
        {"op": "in", "args": [
            {"property": "monitoring_location_id"},
            ["USGS-07367300", "USGS-03277200"],
        ]},
    ],
}
df, md = waterdata.get_waterdata(service="daily", cql=cql)

# 2. Monitoring locations whose HUC starts with "02070010" — LIKE with %.
df, md = waterdata.get_waterdata(
    service="monitoring-locations",
    cql='{"op": "like", "args": ['
        '{"property": "hydrologic_unit_code"}, "02070010%"]}',
)

Both examples mirror the R reference.

API

def get_waterdata(
    service: str,
    cql: str | dict,
    *,
    properties: str | Iterable[str] | None = None,
    bbox: list[float] | None = None,
    limit: int | None = None,
    skip_geometry: bool | None = None,
    convert_type: bool = True,
    client: requests.Session | None = None,
) -> tuple[pd.DataFrame, BaseMetadata]:
  • service validated against the recognized OGC collections (daily, continuous, latest-*, peaks, field-measurements*, channel-measurements, monitoring-locations, time-series-metadata, combined-metadata).
  • cql accepts dict (JSON-serialized internally) or str (passed through verbatim).
  • properties honors the same "id"output_id rewrite the typed wrappers do.
  • client lets callers reuse an HTTP session.

What's reused vs. new

Reused unchanged: _walk_pages, _deal_with_empty, _arrange_cols, _type_cols, _sort_rows, _switch_properties_id, _default_headers, BaseMetadata. The whole post-processing pipeline drops in.

New:

  • _OUTPUT_ID_BY_SERVICE (utils.py): single mapping from service to the renamed-id column. Hoisted from the typed wrappers so the generalized entry point picks the right one.
  • _construct_cql_request (utils.py): focused POST/CQL2 request builder. Kept separate from _construct_api_requests because that function derives the CQL body from typed kwargs; here the body comes in verbatim.
  • get_waterdata (api.py): the public entry point.

Smoke test

>>> df, md = get_waterdata(service="daily",
...     cql='{"op":"in","args":[{"property":"monitoring_location_id"},["USGS-02238500"]]}',
...     limit=5)
>>> md.url
'https://api.waterdata.usgs.gov/ogcapi/v0/collections/daily/items?skipGeometry=False&limit=5'
>>> df.shape
(140, 12)

The POST goes out, the OGC server filters by the CQL body, pagination handles the multi-page response, and post-processing renames iddaily_id, types columns, and sorts rows. (140 rows came back from a single site with a small limit=5 page size — pagination cycled through ~28 pages.)

Out of scope

  • Unit tests for the new function (would mirror existing tests/waterdata_test.py patterns; can follow up).
  • The hash-ID drop default behavior in feat(waterdata): drop hash-valued ID columns by default #281: that PR's include_hash_ids parameter would apply uniformly to get_waterdata once it lands (no extra plumbing needed).

Test plan

  • Module import + signature
  • _construct_cql_request builds the right URL, headers, and body offline
  • Live smoke test against the USGS OGC API with a simple CQL in predicate
  • Live test with the wildcard like example (rate-limited at time of submission)
  • Unit tests via requests_mock

🤖 Generated with Claude Code

thodson-usgs and others added 2 commits May 18, 2026 19:23
Python analogue of R ``dataRetrieval::read_waterdata``. The typed
``get_*`` wrappers (``get_daily``, ``get_continuous``, …) only support
exact-equality predicates on whitelisted parameters. Some users need
more — top-level ``or``, ``like`` with ``%`` wildcards, comparison
operators, nested boolean trees — and today have no surface for it.
``get_waterdata(service, cql, ...)`` accepts a raw CQL2 query
(``dict`` or pre-serialized JSON string) and POSTs it against any
recognized collection, then walks pages and post-processes the
result with the same pipeline the typed wrappers use.

Reuses existing infrastructure: ``_walk_pages``, ``_deal_with_empty``,
``_arrange_cols``, ``_type_cols``, ``_sort_rows``, and
``_switch_properties_id``. The new pieces are:

  - ``_OUTPUT_ID_BY_SERVICE`` (utils.py) — a single mapping from
    service name to the renamed-``id`` column the rest of the package
    exposes, hoisted from the typed wrappers so the generalized entry
    point can pick the right one.
  - ``_construct_cql_request`` (utils.py) — focused POST/CQL2 request
    builder; distinct from ``_construct_api_requests`` because the
    body comes in verbatim rather than being derived from typed
    kwargs.
  - ``get_waterdata`` (api.py) — public entry point.

CQL2 grammar reference:
https://api.waterdata.usgs.gov/docs/ogcapi/complex-queries/

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Code-review pass on PR DOI-USGS#284.

- Lift ``WATERDATA_SERVICES`` Literal into ``types.py``. Use it as
  the ``service`` arg type of ``get_waterdata`` so editors offer
  completion and type-checkers catch typos. The runtime source of
  truth (``_OUTPUT_ID_BY_SERVICE`` in utils.py) is unchanged; the
  Literal is kept in sync by hand and a comment notes that.

- Extract ``_ogc_query_params(properties, bbox, limit, skip_geometry)``
  in utils.py. The same ``skipGeometry``/``limit``/``bbox``/``properties``
  block previously appeared twice — once in ``_construct_api_requests``
  and once in the new ``_construct_cql_request`` — and is now built
  in one place.

- Extract ``_finalize_ogc_frame(df, response, properties, service,
  output_id, convert_type)`` for the post-processing tail
  (``_deal_with_empty`` -> ``_type_cols`` -> ``_arrange_cols`` ->
  ``_sort_rows`` -> ``BaseMetadata``). Both ``get_ogc_data`` and
  ``get_waterdata`` route through it now, so the typed-kwargs and
  raw-CQL2 paths produce identically-shaped DataFrames by
  construction rather than by parallel maintenance.

- Drop the ``client`` kwarg from ``get_waterdata``. None of the
  other public ``get_*`` getters expose it, and the rationale (HTTP
  session reuse) applies to all of them or none. If we want to
  expose session reuse, that's a separate PR that touches the whole
  family.

- Collapse the ``properties`` normalization block to None-first
  ordering so the common case (no properties) reads first.

- Drop the docstring breadcrumb to ``utils._OUTPUT_ID_BY_SERVICE``;
  point readers at ``types.WATERDATA_SERVICES`` (the user-facing
  Literal) instead.

All 148 unit tests pass; ``_construct_api_requests`` and
``_construct_cql_request`` produce byte-identical requests to before.
@thodson-usgs
Copy link
Copy Markdown
Collaborator Author

do we need to shield this against string comparisons as we do in filters.py?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant