Skip to content

feat(responseobs): add threshold-gated large-response counter#243

Open
adamyeats wants to merge 1 commit intofeat/responseobsfrom
feat/responseobs-counter
Open

feat(responseobs): add threshold-gated large-response counter#243
adamyeats wants to merge 1 commit intofeat/responseobsfrom
feat/responseobs-counter

Conversation

@adamyeats
Copy link
Copy Markdown
Contributor

Summary

Adds a threshold-gated counter, plugins_sql_large_responses_total, inside the responseobs subpackage introduced by #242. The counter increments once per Observation that crosses a configured threshold — i.e. at the same decision point that fires the structured warn log in #242.

Stacked on #242 — base branch is feat/responseobs, not main. Review #242 first; the diff here shows only the counter additions.

Shape

plugins_sql_large_responses_total counter{datasource_type, app_url, datasource_uid}

Cardinality note for reviewers

This is the part most likely to draw a reflexive reject, so calling it out:

  • Increments happen only when a threshold is crossed (default 50 MiB OR 1M rows, per feat(responseobs): add subpackage for large-response observation #242). Normal-sized responses produce no new series.
  • Steady-state series count ≈ (stacks with abusive queries) × (avg large-datasource-instances per stack). Order of magnitude: tens to hundreds, not tens of thousands.
  • Contrast with the histograms in feat: emit response-size histograms from DBQuery #241, which are {datasource_type} only. Putting app_url/uid on the histogram would have blown Mimir limits; putting them on this threshold-gated counter is the intended trade — per-stack identification for alerting, but the gate prevents unbounded growth.

If cardinality does prove higher than estimated in production, a Prometheus relabel drop on app_url is the immediate mitigation — documented here so oncall doesn't have to rediscover it.

Label choices

  • app_url replaces slug. backend.GrafanaConfig exposes AppURL() but no dedicated slug accessor. The feat(responseobs): add subpackage for large-response observation #242 log field uses app_url for the same reason — keeping labels consistent between log and counter. If anyone knows a reliable slug source I missed, happy to switch.
  • datasource_uid included (not on histograms) — the counter is where operators drill into a specific abusive datasource instance, so the UID is load-bearing. The threshold gate makes the cardinality cost acceptable.
  • No datasource_name — would require label sanitization (names can have spaces/special chars). UID is sufficient for identification.

Integration point

One line in Observe:

largeResponsesCounter.WithLabelValues(obs.Datasource.Type, appURL, obs.Datasource.UID).Inc()

Placed right after the backend.Logger.Warn call. No caller-side changes needed — every consumer of responseobs.Observe picks up the counter automatically.

Suggested alert (for downstream consumers)

Not shipping alert rules — the consuming team owns that. Suggested shape:

sum by (datasource_type, app_url, datasource_uid) (
  rate(plugins_sql_large_responses_total[15m])
) > 0

i.e. "any SQL datasource producing large responses in the last 15m". Tune as needed.

Introduces plugins_sql_large_responses_total counter, incremented once
per Observation that crosses a configured threshold. Cardinality is
self-limiting because increments only happen on crossings.

Labels: datasource_type, app_url, datasource_uid. app_url replaces the
earlier "slug" label because backend.GrafanaConfig exposes no dedicated
slug accessor; operators can derive a slug by parsing the URL.
@adamyeats adamyeats requested a review from a team as a code owner April 22, 2026 12:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant