Skip to content

refactor(waterdata): Unify list and filter chunkers into one joint planner#283

Draft
thodson-usgs wants to merge 17 commits into
DOI-USGS:mainfrom
thodson-usgs:chunker-unified
Draft

refactor(waterdata): Unify list and filter chunkers into one joint planner#283
thodson-usgs wants to merge 17 commits into
DOI-USGS:mainfrom
thodson-usgs:chunker-unified

Conversation

@thodson-usgs
Copy link
Copy Markdown
Collaborator

Stacked on top of #280. The diff includes #280's chunker commits; rebase after #280 merges to see only the unification delta (~460 added / 470 removed = net −10 lines).

Replaces the two-decorator stack (@multi_value_chunked outside @filters.chunked) with a single planner that allocates URL byte budget across list dims and filter clauses together.

Why

The old two-layer design had a real suboptimality: the outer planner sized list chunks against the inner chunker's bail-floor (one clause of the longest size). When both dims were simultaneously long, the outer would over-chunk the list, forcing the inner to over-chunk the filter into many sub-requests. Up to ~10× sub-request inflation in pathological cases.

It also carried accidental complexity from the cross-decorator coordination — _filter_aware_probe_args, _max_per_clause_encoding_ratio, _effective_filter_budget, and the bail-floor probe machinery existed only because the two layers couldn't see each other. The hardest-to-explain parts of the codebase.

Algorithm

  1. Enumerate filter chunk counts k = 1, 2, 4, ..., n_clauses.
  2. For each k, partition clauses into k balanced groups joined by OR and identify the worst (longest URL-encoded) group.
  3. Substitute the worst group as the filter and plan list dims with greedy halving against the remaining budget.
  4. Pick the candidate whose list_count × k is smallest.

Net code change (delta vs PR #280's state)

  • filters.py: −110 lines (retired chunked decorator + _effective_filter_budget + _max_per_clause_encoding_ratio + _NON_FILTER_URL_HEADROOM)
  • chunking.py: roughly flat (joint planner adds the budget-search loop; bail-floor coordination machinery and _filter_aware_probe_args removed)
  • utils.py: −1 line (unstacked decorators on _fetch_once)
  • Tests: rewrites of cross-decorator coordination tests collapse into joint-planner tests; new URL-construction stress test added.

Total: −12 lines net, with the structurally hardest-to-explain coordination layer gone.

Regression test for URL construction

test_joint_planner_url_construction_long_filter_and_long_sites exercises the joint planner with 500 USGS site IDs + 20 datetime OR-clauses using the real _construct_api_requests builder (not a fake). The test asserts:

  • Every sub-request URL stays under the 8000-byte limit.
  • Filter partitions cover every original clause exactly once.
  • List partitions cover every original site exactly once.
  • Total sub-request count beats the bail-floor-style worst case (500 × 20 = 10,000 → joint planner reduces to <500, and in this case finds an optimal 2 sub-requests with no filter chunking needed).

Live API verified

The canonical doc example (Ohio Stream sites → daily discharge for P7D) runs end-to-end against the live USGS API. 2,888 sites chunk into 12 sub-requests, 1,455 rows of daily discharge returned, canonical md.url preserved (58,138 bytes), cumulative md.query_time accurate.

What was preserved

  • RequestTooLarge and QuotaExhausted exception shapes — same caller-facing contract.
  • Helper functions (_split_top_level_or, _chunk_cql_or, _is_chunkable, _check_numeric_filter_pitfall, _combine_chunk_frames, _combine_chunk_responses) stay in filters.py as primitives the joint planner uses.
  • Lexicographic-comparison pitfall guard still fires on chunkable filters (moved into _plan_joint).
  • Quota-floor safety check between sub-requests.
  • Canonical URL restoration so md.url always reflects the user's original query.

Coordination with PR #282 (resume)

PR #282's ChunkManifest.completed indexes into a list-only cartesian product. With the joint planner, the index space grows by × len(filter_chunks) — a mechanical update to the manifest plan representation (add a filter_chunks dim), not a design change. To be addressed in a follow-up rebase of #282.

thodson-usgs and others added 9 commits May 17, 2026 11:44
For multi-value waterdata queries (e.g. monitoring_location_id with
~300+ sites), the GET URL produced by PR DOI-USGS#233 blows past the server's
~8 KB nginx buffer and the API returns HTTP 414. This PR adds a
chunker that transparently splits long list params across sub-requests
so each URL fits the byte budget.

The chunker is a decorator applied to ``_fetch_once`` outside the
existing ``@filters.chunked`` (CQL chunker), so list-chunking is the
outer loop and filter-chunking is the inner loop:

  @chunking.multi_value_chunked(build_request=_construct_api_requests)
  @filters.chunked(build_request=_construct_api_requests)
  def _fetch_once(args): ...

Key design points:

- ``_plan_chunks`` greedy-halves the largest chunk across all
  dimensions until the worst-case sub-request fits ``url_limit``
  (URL + body, via ``_request_bytes``, so POST routes are sized
  correctly). Cartesian product of per-dim partitions becomes the
  sub-request set; capped at ``max_chunks=1000``.

- ``_filter_aware_probe_args`` coordinates with ``filters.chunked``:
  the planner probes URL length using a synthetic clause that matches
  the inner filter chunker's bail-floor size (longest single clause,
  scaled by worst-case URL encoding ratio). Without this coordination,
  the outer planner would raise ``RequestTooLarge`` on combinations
  the stacked chunkers can actually handle.

- ``QuotaExhausted`` mid-call guard reads ``x-ratelimit-remaining``
  after each sub-request; if it drops below ``quota_safety_floor=50``,
  the wrapper raises with the partial frame, completed-chunk offset,
  and last observed remaining quota — letting callers salvage or
  resume after the rate-limit window resets, rather than crash into a
  silent mid-pagination 429.

- ``RequestTooLarge`` is raised when the smallest reducible plan
  still exceeds ``url_limit`` (every multi-value param at a singleton
  chunk and any chunkable filter at the inner chunker's bail floor)
  or when the cartesian product exceeds ``max_chunks``.

- All defaults (``url_limit``, ``max_chunks``, ``quota_safety_floor``)
  resolve at call time, so monkey-patching ``filters._WATERDATA_URL_
  BYTE_LIMIT`` for tests / non-default quotas affects the decorator
  uniformly.

Public additions:

- ``dataretrieval.waterdata.chunking.multi_value_chunked``
- ``dataretrieval.waterdata.chunking.RequestTooLarge``
- ``dataretrieval.waterdata.chunking.QuotaExhausted`` (carries
  ``partial_frame``, ``partial_response``, ``completed_chunks``,
  ``total_chunks``, ``remaining``)

Tests (30 new):

- ``_filter_aware_probe_args`` worst-case-clause modelling
- ``_plan_chunks`` greedy halving, RequestTooLarge floor, filter-
  chunker coordination, ``max_chunks`` cap, lazy-default reads
- ``multi_value_chunked`` pass-through, cartesian-product shape,
  end-to-end with stacked filter chunker
- ``QuotaExhausted`` header parsing, mid-call abort, last-chunk no-
  abort, zero-floor disable
- ``RequestTooLarge`` message contents and triggering conditions

End-to-end correctness verified against the live API: identical
per-site cell-for-cell output between unchunked (single call) and
chunked (forced fan-out via patched limit) paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two correctness gaps surfaced in review:

1. ``limit`` and ``skip_geometry`` are scalars by contract (``int | None``
   and ``bool | None``) but a list smuggled through type erasure
   (e.g. ``limit=["100","200"]`` slipping past _normalize_str_iterable
   when elements happen to be strings) would be picked up by
   ``_chunkable_params`` and fanned into multiple sub-requests with
   conflicting per-request caps. Add both to ``_NEVER_CHUNK`` so the
   chunker leaves scalar-by-contract params alone.

2. ``quota_safety_floor=0`` is the documented "disable the guard"
   sentinel, but negative values were accepted silently and also
   disabled the guard — obscuring caller intent. Reject at decoration
   time, parallel to ``_plan_chunks``'s ``max_chunks < 1`` check.
…anner

Replaces the two-decorator stack (@multi_value_chunked outside
@filters.chunked) with a single planner that allocates URL byte
budget across list dims and filter clauses together. Same correctness
guarantees, fewer sub-requests when the previous design forced the
inner filter chunker to bail at its singleton-clause floor while the
outer list chunker held the bulk of the URL budget.

Algorithm:

- Enumerate filter chunk counts k = 1, 2, 4, ..., n_clauses.
- For each k, partition clauses into k balanced groups joined by OR
  and identify the worst (longest URL-encoded) group.
- Substitute the worst group as the filter and plan the list dims
  with greedy halving against the remaining byte budget.
- Pick the candidate whose list_count × k is smallest.

Net code shrinks: -50 lines in filters.py (retired the chunked
decorator and _effective_filter_budget), +30 in chunking.py for the
joint planner (offset by removing _filter_aware_probe_args and the
bail-floor coordination machinery), unstack the decorator pair on
_fetch_once. Two existing cross-decorator coordination tests collapse
into joint-planner tests (mismatched-clause-length probing was the
hardest-to-explain artefact of the old design — gone now).

New regression test: ``test_joint_planner_url_construction_long_filter_and_long_sites``
exercises the planner with 500 USGS site IDs + 20 datetime OR-clauses
using the real ``_construct_api_requests`` builder. Confirms every
sub-request URL stays under 8000 bytes, filter partitions cover every
clause exactly once, list partitions cover every site exactly once,
and the total sub-request count beats the naive bail-floor-style
worst case by ≥4×.

Live API verified: Ohio Stream sites (2888) → daily discharge (P7D)
chunks into 12 sub-requests with canonical URL preserved and
cumulative query_time accurate.
…both chunking dims

The two chunking dimensions (list values and CQL OR-clauses) shared an
obvious primitive: "URL-encoded byte length of atoms joined by a
separator." Extract _joined_url_bytes(atoms, sep); list dims call it
with "," and filter dims call it with " OR ". _chunk_bytes collapses to
a one-liner using the helper, and the inline len(quote_plus(c or "")) in
the joint planner becomes _joined_url_bytes(group, " OR ").

Partition shape also unifies: _partition_clauses now returns
list[list[str]] (raw atom groups) instead of pre-joined strings. The
joint planner sizes candidates by _joined_url_bytes on the raw groups
and joins only the winning groups for the wrapper to iterate, so
discarded partition candidates never pay the join cost.

Side cleanups motivated by the /simplify review:

- Add "filter" to _NEVER_CHUNK so _chunkable_params doesn't need a
  k != "filter" special case alongside the frozenset check.
- Drop the redundant filter_chunkable variable in _plan_joint; derive
  from len(clauses) >= 2.
- Bug fix in _plan_joint: when there are no list dims to shrink and the
  filter alone overflows the URL limit, the planner used to pick k=1
  and emit one over-limit sub-request. Now it verifies the request
  fits with the chosen filter chunking before accepting that k.

Dead code removal:

- _chunk_cql_or and _CQL_FILTER_CHUNK_LEN in filters.py had zero
  production callers after the joint planner subsumed their role.
  Deleted, with their 4 unit tests.
- 4 _effective_filter_budget tests (function already deleted in the
  unification commit) and their _build_request / _WATERDATA_URL_BYTE_LIMIT
  test scaffolding.

Test rewrites: the three end-to-end tests that previously mocked
_effective_filter_budget (long_filter fan-out, dedup, empty-chunk
GeoDataFrame preservation) now exercise the joint planner directly via
a filter-size-aware fake URL builder. Same invariants, no mock of
removed internals.

Net diff: -180 lines across 4 files (-72 production, -108 tests).
Three small extractions and one minor optimization. No behavior change;
130 chunker/filter tests stay green.

_iter_sub_args generator yields per-sub-request args dicts; the wrapper's
nested-loop-with-manual-counter collapses to ``for i, sub_args in
enumerate(...)``. The "is this the last sub-request" branch in the
quota-floor check flips to ``if i == total - 1: continue`` so the gate
is a guard clause rather than the body of an inverse condition.

_finalize_response folds the ``_combine_chunk_responses(responses);
response.url = canonical_url`` pattern (used in both the success path
and the QuotaExhausted partial-state payload) into one helper.

_filter_candidates generator emits ``(filter_chunks, worst_filter)``
pairs for each candidate filter chunk count; ``_plan_joint`` then
iterates candidates uniformly without the ``if filter_chunkable: ...
else: ...`` fork. The redundant ``filter_chunkable`` flag is gone —
``len(clauses) >= 2`` is the single truth.

Per-iteration optimization: ``{**args, **list_overrides}`` was being
recomputed for every filter chunk; now built once per outer combo and
reused (or shallow-overridden when a filter substitution applies).

Module constants ``_LIST_SEP = ","`` and ``_OR_SEP = " OR "`` replace
the scattered string literals — the two chunking dimensions are now
self-documenting at every call site that sizes them.
…sub_args

Three micro-refinements after the previous pass settles. No behavior
change; 130 tests stay green.

- Extract _resolve_max_chunks() so the default + validation rule for
  ``max_chunks`` lives in one place, called from both _plan_list_chunks
  and _plan_joint. The 5-line if-None/if-<1 block was duplicated verbatim.
- _iter_sub_args drops its explicit ``list_keys = list(list_plan)``
  cache; iterating ``list_plan`` directly gives the same insertion-order
  sequence (Python 3.7+ dict guarantee), and ``zip(list_plan, combo)``
  reads as "pair each list-dim name with its chunk for this combo".
- Tighten the wrapper's option resolution to the "default if None else
  passed" form so each line reads in argument order.
- Categorize the _NEVER_CHUNK comment so future additions land in the
  right category instead of a flat narrative.
…rule

After investigating: the OGC getters expose ~94 list-shaped params, all
chunkable. The current 13-entry denylist captures every exception. An
allowlist would be ~7x longer and would need updating every time USGS
adds a column.

Reframe the comment to state the default rule first ("any list-shaped
kwarg gets chunked"), then enumerate the exceptions by reason
(response-shaping, structured, intervals, handled-elsewhere, scalar-by-
contract). Reads as "here are the few cases the default-chunk rule
doesn't apply" rather than "here is an arbitrary exclusion set."
Standalone runner (``python3 tests/stress_chunker.py``) that exercises
the chunker across eight scenarios with the URL byte limit lowered well
below the live API's. No live HTTP — mocks fetch_once and uses the real
_construct_api_requests for URL sizing.

Per-scenario invariants verified:

  1. Every sub-request URL ≤ url_limit (primary correctness).
  2. List-dim coverage: the union of distinct chunks issued for each
     list dim equals the input with no overlap (no data dropped, no
     duplicate fetches of the same atom within its dim).
  3. Filter-clause coverage: the distinct filter chunks split back into
     clauses, concatenated in iteration order, equal the original
     clauses (lossless OR-disjunction).
  4. Speedup vs the bail-floor-singleton baseline that the old two-
     decorator design would have produced in pathological cases.

Plus a greedy-search adaptation check: scanning ``url_limit`` across
1200 → 10000 confirms sub-request count is monotonically non-increasing
as the budget grows (the planner adapts to the limit).

Scenarios:
  A. Long sites only (pure list chunking)
  B. Long filter only (pure filter chunking)
  C. Long sites + long filter (joint trade-off — 1000× vs baseline)
  D. 3-D list cartesian product (3000× vs baseline)
  E. Lopsided clause sizes (worst-case sizing)
  F. URL-encoding-heavy clauses (quote_plus inflation)
  G. Very tight URL limit (singleton chunks)
  H. Generous URL limit (no chunking needed)
  I. url_limit sweep proving greedy adaptation

All 15 chunked calls pass every invariant.
…, trim sweep

Profile showed `_construct_api_requests` (PreparedRequest building)
dominated the stress test's runtime: 421 calls / ~152ms of the ~290ms
profile time. ~75 of those calls came from ``assert_urls_fit``
re-walking every captured sub-request to rebuild its URL after the
chunker had already built it during planning.

Two simple changes:

- ``run_chunked`` now returns a parallel ``url_bytes_seen`` list; the
  mock ``fetch_once`` captures the built URL's byte count once during
  execution. ``assert_urls_fit`` just compares ints instead of
  rebuilding PreparedRequests.
- The url_limit sweep dropped from 7 points × (150 sites, 30 clauses)
  to 5 points × (100 sites, 20 clauses). Monotonicity reads just as
  clearly with the smaller grid — the curve (8 → 2 → 2 → 1 → 1) is
  unambiguous.

Result: 118ms → 53ms per run. 13 chunked calls, every invariant still
holds.
Copy link
Copy Markdown
Collaborator Author

@thodson-usgs thodson-usgs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pay close attention to the layout: are all variables and functions placed logically into their modules? Or has the logic been mixed up.

Comment thread dataretrieval/waterdata/chunking.py Outdated
- Chunkable dims include multi-value list params (sites, parameter
codes, ...) and the cql-text ``filter`` (split at top-level ``OR``
to keep each chunk valid CQL).
- The planner enumerates candidate filter chunk counts
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is k here?

Comment thread dataretrieval/waterdata/chunking.py Outdated
Comment on lines +125 to +130
class QuotaExhausted(RuntimeError):
"""Raised mid-chunked-call when the API's reported remaining quota
(``x-ratelimit-remaining`` header) drops below the configured safety
floor. The chunker stops before issuing the next sub-request to
avoid a mid-call HTTP 429 that would silently truncate paginated
results.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a bug. A mid-call HTTP 429 should not silently truncate. If it does, fix it, then we won't need to defend against this case.

Comment thread dataretrieval/waterdata/filters.py Outdated
# per-request budget from ``_WATERDATA_URL_BYTE_LIMIT``.
_CQL_FILTER_CHUNK_LEN = 5000

# Empirically the API replies HTTP 414 above ~8200 bytes of full URL —
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be moved to a different module now?

thodson-usgs and others added 2 commits May 18, 2026 14:17
…r helpers, clarify docs

Three review responses bundled together:

- chunking.py module docstring: define ``k`` as the candidate filter
  chunk count before using it in the planner description.
- ``QuotaExhausted`` docstring: drop the stale "silently truncate"
  framing. PR DOI-USGS#273 / DOI-USGS#279 already raise on a mid-pagination 429, so
  this exception is the structured-recovery alternative (partial
  frames in hand) rather than a defense against silent truncation.
- Move chunker-only orphans from filters.py to chunking.py:
  ``_WATERDATA_URL_BYTE_LIMIT`` (the URL byte ceiling),
  ``_FetchOnce`` TypeVar, ``_combine_chunk_frames``, and
  ``_combine_chunk_responses``. filters.py was a leftover home from
  the pre-unification two-decorator stack; these helpers have no
  callers outside the chunker. Test ``test_multi_value_chunked_lazy_url_limit``
  now monkeypatches the constant on its new module.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three test docstrings/comments still framed their reasoning against the
removed two-decorator stack (PR DOI-USGS#283 unified them). Reword to describe
the current joint-planner behavior on its own terms:

- ``test_plan_joint_fans_out_filter_when_list_alone_cannot_fit``: drop
  the "previous two-decorator design" aside.
- ``test_chunkable_params_skips_filter_passed_as_list``: rewrite the
  "inner filters.chunked is the only place that may shrink filter"
  line to point at ``_plan_joint``.
- ``stress_chunker._bail_floor_baseline``: reframe the baseline as
  "degenerate singleton plan" rather than "worst case the old
  two-decorator design produced."

No behavioral changes; prose only. Chunker tests + offline stress
test still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator Author

@thodson-usgs thodson-usgs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix my comments

Comment thread dataretrieval/waterdata/chunking.py Outdated
Comment on lines +3 to +8
PR 233 routes most services through GET with comma-separated values
(e.g. ``monitoring_location_id=USGS-A,USGS-B,...``). Long lists and
long top-level-``OR`` CQL filters can independently blow the server's
~8 KB URL byte limit. This module adds a single decorator that plans
both chunking dimensions together and iterates the joint cartesian
product so each sub-request URL fits.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tighten this to summarize the correct design. Don't mention PR 233 or the old design.

Comment thread dataretrieval/waterdata/chunking.py Outdated
Comment on lines +90 to +95
# Default cap on the number of sub-requests a single chunked call may
# emit. The USGS Water Data API rate-limits each HTTP request
# (including pagination), so the true budget is
# ``hourly_quota / avg_pages_per_chunk``. 1000 matches the default
# hourly quota.
_DEFAULT_MAX_CHUNKS = 1000
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just set this as x-ratelimit-remaining by default

Comment thread dataretrieval/waterdata/chunking.py Outdated
Comment on lines +97 to +101
# When ``x-ratelimit-remaining`` drops below this between sub-requests,
# the chunker bails with ``QuotaExhausted`` rather than risk a mid-call
# HTTP 429. Carries the partial result so callers can resume from a
# known offset instead of retrying the whole chunked call from scratch.
_DEFAULT_QUOTA_SAFETY_FLOOR = 50
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after the retrieving the first chunk, just check x-ratelimit-remaining, and if the plan will not fit within our current rait limit, bail and return an Error message that the query would exceed our rate limit and report by how much.

Comment thread dataretrieval/waterdata/chunking.py Outdated
Comment on lines +134 to +144
class QuotaExhausted(RuntimeError):
"""Raised mid-chunked-call when the API's reported remaining quota
(``x-ratelimit-remaining`` header) drops below the configured safety
floor. The chunker stops before issuing the next sub-request and
surfaces the partial result so callers can resume after the hourly
window resets.

A bare 429 raised by ``_walk_pages`` would also abort the call but
discard the chunks completed so far; this exception is the
structured-recovery alternative, triggered pre-emptively while the
accumulated frames are still in hand.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revise this. This error should raise if we prematurely exhaust our quota because another processes is using it faster than predicted. A single process should never incounter this, because it raises a RequestExceedsQuota value error after checking the limit in the first chunk

Comment thread dataretrieval/waterdata/chunking.py Outdated
"""Decorator that splits multi-value list params and cql-text
filters across sub-requests so each sub-request URL fits
``url_limit`` bytes (defaults to ``_WATERDATA_URL_BYTE_LIMIT``)
and the joint cartesian-product plan stays ≤ ``max_chunks``
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again max_chunks should be replaced by the our current rate limit (which we know after receiving the first chunk)

Comment thread dataretrieval/waterdata/chunking.py Outdated
Comment on lines +560 to +564
Between sub-requests the wrapper reads ``x-ratelimit-remaining`` from
each response. If it drops below ``quota_safety_floor`` (default
``_DEFAULT_QUOTA_SAFETY_FLOOR``), the wrapper raises
``QuotaExhausted`` carrying the combined partial result and the
chunk offset so callers can resume after the hourly window resets.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also change. There is no quota_saftey_floor anymore. But raise the QuotaExhausted if we receive an HTTP error telling us the quota was exhuasted.

…mic rate-limit gate

Addresses PR DOI-USGS#283 review feedback. The static caps
(``_DEFAULT_MAX_CHUNKS=1000``, ``_DEFAULT_QUOTA_SAFETY_FLOOR=50``) and
the matching ``max_chunks`` / ``quota_safety_floor`` decorator
parameters are replaced by a quota check that runs after the first
sub-request, using the real ``x-ratelimit-remaining`` value rather
than a guessed cap.

Behavior:

- After the first sub-request the wrapper reads
  ``x-ratelimit-remaining``. If the rest of the plan won't fit in
  the current rate-limit window, it raises a new
  ``RequestExceedsQuota(ValueError)`` carrying ``planned_chunks``,
  ``available``, and ``deficit`` so the message reports exactly how
  far over budget the call is. The first chunk has already been
  issued; the wrapper stops there rather than burn the rest of the
  quota on a call that will fail mid-way.

- ``QuotaExhausted`` is now triggered only when an actual HTTP 429
  propagates from a sub-request (detected by walking ``__cause__``
  for ``RuntimeError("429: ...")``, the shape ``_raise_for_non_200``
  produces and ``_walk_pages`` wraps). A single-process caller
  should not normally see this — ``RequestExceedsQuota``
  short-circuits in chunk 1; arrival here implies a concurrent
  consumer drained the bucket faster than predicted. Carries the
  partial frame for resume. ``partial_response`` becomes ``None``
  when the 429 hits chunk 0 (no banked responses).

- A non-429 ``RuntimeError`` (e.g. 500) propagates unchanged so the
  real cause surfaces to the caller.

- When the server doesn't echo ``x-ratelimit-remaining``,
  ``_read_remaining`` returns ``_QUOTA_UNKNOWN``; the wrapper skips
  the post-first-chunk quota check (no signal → don't synthesize a
  block).

Planner: ``_plan_list_chunks`` / ``_plan_joint`` no longer carry a
``max_chunks`` cap. ``RequestTooLarge`` fires only when nothing more
can be split (the genuine URL-byte floor). The rate-limit gate
replaces the static cap.

Module docstring rewritten to summarize the current design (joint
planning + dynamic quota gate); historical PR 233 / two-decorator
references dropped.

Tests: ten obsolete cap/floor tests removed; eight new tests added
covering ``RequestExceedsQuota`` after chunk 0, deficit reporting,
the no-header skip path, mid-call 429 → ``QuotaExhausted`` with
partial frame, the first-chunk 429 (partial_response=None) edge
case, and non-429 ``RuntimeError`` pass-through.

``_fetch_once`` in ``utils.py`` calls the decorator with defaults
only, so no call-site changes are needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@thodson-usgs
Copy link
Copy Markdown
Collaborator Author

Pushed 01e579e reworking the quota machinery per all the inline comments. Summary:

Module docstring (line 8 comment): tightened. No more PR 233 / two-decorator references — it just describes the current design (joint planner + dynamic quota gate).

_DEFAULT_MAX_CHUNKS / _DEFAULT_QUOTA_SAFETY_FLOOR (line 95, 101 comments): both deleted. After the first sub-request the wrapper reads x-ratelimit-remaining directly; if the remaining plan won't fit in the current window, it raises a new RequestExceedsQuota(ValueError) carrying planned_chunks, available, and deficit so the message reports exactly how far over budget the call is.

QuotaExhausted (line 144 comment): rewritten. It now fires only when an actual HTTP 429 propagates from a sub-request (detected by walking __cause__ for RuntimeError("429: ..."), the shape _raise_for_non_200 produces). The docstring states explicitly that a single-process caller should not normally see this — RequestExceedsQuota short-circuits in chunk 1; arrival here implies a concurrent consumer drained the bucket faster than predicted. Carries the partial frame for resume. partial_response is None when the 429 hits chunk 0.

max_chunks / quota_safety_floor decorator params (line 552, 564 comments): removed. _plan_list_chunks and _plan_joint no longer carry a max_chunks cap; RequestTooLarge fires only on the genuine "nothing left to split" floor. The rate-limit gate replaces the static cap.

Side effects: _fetch_once in utils.py already called the decorator with defaults only, so no call-site changes were needed. Ten obsolete cap/floor tests were removed and eight new ones added covering RequestExceedsQuota after chunk 0, deficit reporting, the no-header skip path, mid-call 429 → QuotaExhausted with partial frame, the first-chunk 429 (partial_response=None) edge case, and non-429 RuntimeError pass-through. All chunker tests + offline stress test pass.

thodson-usgs and others added 2 commits May 18, 2026 19:53
…st.py

The chunker section had grown to 30 tests + a fake-builder harness
and shared ``_quota_response`` helper — a self-contained subsystem
sharing a file with live-API getter tests that have nothing in
common with it. The live-API file carries a ``pytest.mark.flaky``
rerun marker tuned for transient upstream errors; the chunker tests
are pure unit tests against a fake builder and shouldn't be subject
to that retry logic at all.

- Move the 30 chunker tests, ``_FakeReq`` / ``_fake_build`` /
  ``_quota_response`` helpers, and the section comment that
  introduces the fake-builder model into the new file.
- The new file imports only what it needs (``_chunking``, the
  chunking exports, and ``_construct_api_requests`` for the one
  real-builder URL-construction test).
- Drop now-unused imports from ``waterdata_test.py``
  (``itertools``, ``math``, ``quote_plus``).
- No content changes to the tests themselves; this is a relocation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…d by unit tests

The stress test exercised the joint planner across eight scenarios
against a fake URL builder, checking URL-bytes-within-limit,
exhaustive cartesian-product coverage, and a beats-the-bail-floor
quality bar. Every one of those properties is now asserted by
``tests/waterdata_chunking_test.py``, including
``test_joint_planner_url_construction_long_filter_and_long_sites``
which runs the real ``_construct_api_requests`` builder against a
500-site × 20-clause query and verifies each invariant explicitly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
thodson-usgs and others added 3 commits May 18, 2026 20:14
…op sentinels

Pulls cleanup items out of /simplify's review:

- **Replace string-sniffing 429 detection with a typed exception.**
  ``_raise_for_non_200`` now raises ``utils.RateLimited`` (a
  ``RuntimeError`` subclass) for HTTP 429 specifically, plain
  ``RuntimeError`` for everything else. The chunker's ``_is_429``
  walks ``__cause__`` looking for ``isinstance(_, RateLimited)``
  instead of ``str(cur).startswith("429")``; ``_paginated_failure_message``
  similarly switches to ``isinstance``. Any future reformatting of
  the 429 message can no longer silently break quota handling.

- **Drop the ``_QUOTA_UNKNOWN = -1`` sentinel.** ``_read_remaining``
  now returns ``int | None``; the wrapper branches on
  ``remaining is not None`` instead of comparing against a magic
  number.

- **Collapse ``_finalize_response`` into ``_combine_chunk_responses``.**
  The two were a 3-line wrapper around a 6-line helper; merging
  them removes one call and the indirection. The combiner now takes
  ``canonical_url`` and sets ``.url`` directly.

- **Simplify ``_FetchOnce``** from a constrained ``TypeVar`` to a
  plain ``Callable`` alias. The TypeVar required ``# type: ignore``
  at the return site anyway and bought no callsite type safety.

- Update the ``waterdata_test.py`` flaky-rerun regex to match the
  new ``RateLimited:`` prefix as well as ``RuntimeError:``.

Items considered and skipped:

- Planning-phase efficiency findings (redundant ``build_request``
  probes, ``_filter_candidates`` joining discarded groups, dict-copy
  cost in ``_iter_sub_args``) — all cold-path next to the actual HTTP
  round-trips that follow. Premature optimization.
- Unifying ``_combine_chunk_responses`` with
  ``_finalize_paginated_response`` — they take different inputs
  (one accumulates wall-clock externally, the other builds it from
  the response list); unification would be churn.
- Test docstring trim — separate pass.

All 142 chunker / filter / utils unit tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…hrough into the loop

The decorator wrapper was carrying three concerns: bookkeeping for
the planning result tuple, the mid-loop 429 → ``QuotaExhausted``
translation, and the post-first-chunk ``RequestExceedsQuota`` check.
That spilled out as four locals (``list_plan``, ``filter_chunks``,
``canonical_url``, ``total``) and ~30 lines of orchestration, with
an early-return branch that handled "no chunking needed" differently
from "chunking needed."

Extract two classes:

- ``ChunkPlan`` (frozen dataclass) — the precomputed strategy.
  ``ChunkPlan.from_args(args, build_request, url_limit)`` *always*
  returns a plan. Passthrough requests (no chunking needed) are
  represented as a trivial plan with ``list_chunks={}``,
  ``filter_chunks=[None]``, ``total=1``; ``iter_sub_args`` yields the
  original args unchanged. Owns ``.total`` (property),
  ``.iter_sub_args()``, ``.execute(fetch_once)``, and a ``from_args``
  classmethod that subsumes the old ``_plan_joint``.

- ``_ChunkExecution`` — in-flight execution state. Holds the plan,
  accumulates frames and responses, owns the 429 translation
  (``issue`` catches and re-raises as ``QuotaExhausted``) and the
  post-first-chunk quota check. Exposes ``run(fetch_once)`` which
  drives the whole loop.

The wrapper collapses to two lines of meat:

    limit = _WATERDATA_URL_BYTE_LIMIT if url_limit is None else url_limit
    return ChunkPlan.from_args(args, build_request, limit).execute(fetch_once)

No more early-return branch — the loop is the same shape whether
chunking was needed or not. The trade-off: passthrough requests that
hit 429 now surface as ``QuotaExhausted(completed_chunks=0,
total_chunks=1, partial_response=None)`` rather than the bare
``RateLimited``. The original ``RateLimited`` is on ``__cause__``;
the ``QuotaExhausted`` docstring and message are updated to call out
both the multi-chunk and single-shot cases honestly.

Removed: ``_plan_joint``, ``_plan_total``, ``_iter_sub_args`` (their
roles are now methods on ``ChunkPlan``).

Tests: ``_plan_joint_*`` tests renamed to ``chunk_plan_*`` and
updated to call ``ChunkPlan.from_args`` and check ``.list_chunks`` /
``.filter_chunks`` / ``.total`` attributes; three new passthrough
tests added covering the trivial-plan shape and ``iter_sub_args``
yielding original args. All 145 unit tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…assthrough hot path

Aggregated and applied the meaningful items from the review:

- **Trivial-passthrough skips ``build_request`` entirely.** Previously
  ``ChunkPlan.from_args`` called ``build_request(**args)`` up front to
  capture ``canonical_url`` and to size the request, even when there
  was nothing to chunk. Reorder so the "no multi-value lists, no
  top-level-OR filter" check runs first; on that path the plan is
  built with ``canonical_url=None`` and no request preparation. The
  ~20-80 µs ``Request.prepare()`` overhead is removed from the
  dominant Water Data call shape. ``_combine_chunk_responses`` now
  treats ``canonical_url=None`` as "skip the override" — fine because
  ``_walk_pages`` already pinned the response's ``.url`` to the
  canonical request URL.

- **``iter_sub_args`` short-circuits the trivial-passthrough case** —
  yields ``self.args`` directly instead of allocating a dict copy and
  spinning through an empty cartesian product.

- **``_ChunkExecution`` now owns ``fetch_once``** instead of receiving
  it per-call on ``issue()``. ``fetch_once`` is constant across the
  loop, so threading it through every call was needless. ``issue(sub_args)``
  and ``run()`` are now zero- and one-arg respectively. Converted
  from ``@dataclass`` to a plain class (the auto-generated repr/eq
  weren't earning their keep). The ``completed`` property was inlined
  to its one remaining caller as ``len(self.responses)``.

- **Hoist ``_FILTER_KEY = "filter"``** so the planner and
  ``iter_sub_args`` substitute on the same constant, matching the
  existing ``_LIST_SEP``/``_OR_SEP``/``_QUOTA_HEADER`` convention.

- **``utils._next_req_url``** now references ``chunking._QUOTA_HEADER``
  instead of repeating the ``"x-ratelimit-remaining"`` literal.

- Stale ``_NEVER_CHUNK`` comment that pointed at the removed
  ``_plan_joint`` now points at ``ChunkPlan.from_args``.

Items considered and skipped:

- ``ChunkPlan.canonical_url`` derivable from ``args`` — keeping it
  avoids the extra ``build_request`` call on every ``finalize``.
- ``_plan_list_chunks`` dual-meaning ``None`` return — fixing it
  would touch unrelated callers; the current ``continue`` guard is
  clearly commented.
- ``args: dict`` mutability on the frozen dataclass — internal use
  only; ``MappingProxyType`` adds churn without value.
- ``ChunkPlan.from_args`` length / search-loop extraction — the
  search loop reads well in place; pulling it out would only push
  state through a helper signature.
- ``_count_subrequests`` helper to DRY the ``list_count * len(...)``
  math — used in two adjacent places; not worth a helper.

All 145 unit tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator Author

@thodson-usgs thodson-usgs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix these things

Comment on lines +12 to +18
Planning: for a filter with ``n_clauses`` top-level OR clauses, try
candidate filter chunk counts ``k = 1, 2, 4, ..., n_clauses``. For
each, partition clauses into ``k`` count-balanced groups joined by
``OR``, take the longest (URL-encoded) group as the worst-case filter,
then plan list-dim chunking by greedy halving against the remaining
budget. Keep the candidate with the smallest ``list_count × k``.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this sounds like filter chunking is given priority over list-dim chunking. Shouldn't the filter be treated as any other dim?

Comment on lines +439 to +440
@classmethod
def from_args(
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be cleaner to refactor this to a class.init

return head


class _ChunkExecution:
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we call this QueryExecutor? Really a class should be a noun. So QueryOrchestrator or ChunkManager or ChunkExecutor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant