feat(responseobs): add threshold-gated large-response counter#243
Open
adamyeats wants to merge 1 commit intofeat/responseobsfrom
Open
feat(responseobs): add threshold-gated large-response counter#243adamyeats wants to merge 1 commit intofeat/responseobsfrom
adamyeats wants to merge 1 commit intofeat/responseobsfrom
Conversation
Introduces plugins_sql_large_responses_total counter, incremented once per Observation that crosses a configured threshold. Cardinality is self-limiting because increments only happen on crossings. Labels: datasource_type, app_url, datasource_uid. app_url replaces the earlier "slug" label because backend.GrafanaConfig exposes no dedicated slug accessor; operators can derive a slug by parsing the URL.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a threshold-gated counter,
plugins_sql_large_responses_total, inside theresponseobssubpackage introduced by #242. The counter increments once perObservationthat crosses a configured threshold — i.e. at the same decision point that fires the structured warn log in #242.Stacked on #242 — base branch is
feat/responseobs, notmain. Review #242 first; the diff here shows only the counter additions.Shape
Cardinality note for reviewers
This is the part most likely to draw a reflexive reject, so calling it out:
{datasource_type}only. Puttingapp_url/uidon the histogram would have blown Mimir limits; putting them on this threshold-gated counter is the intended trade — per-stack identification for alerting, but the gate prevents unbounded growth.If cardinality does prove higher than estimated in production, a Prometheus relabel drop on
app_urlis the immediate mitigation — documented here so oncall doesn't have to rediscover it.Label choices
app_urlreplacesslug.backend.GrafanaConfigexposesAppURL()but no dedicated slug accessor. The feat(responseobs): add subpackage for large-response observation #242 log field usesapp_urlfor the same reason — keeping labels consistent between log and counter. If anyone knows a reliable slug source I missed, happy to switch.datasource_uidincluded (not on histograms) — the counter is where operators drill into a specific abusive datasource instance, so the UID is load-bearing. The threshold gate makes the cardinality cost acceptable.datasource_name— would require label sanitization (names can have spaces/special chars). UID is sufficient for identification.Integration point
One line in
Observe:Placed right after the
backend.Logger.Warncall. No caller-side changes needed — every consumer ofresponseobs.Observepicks up the counter automatically.Suggested alert (for downstream consumers)
Not shipping alert rules — the consuming team owns that. Suggested shape:
i.e. "any SQL datasource producing large responses in the last 15m". Tune as needed.