fix(metrics): use RecordHistogramValue for task attempt count histograms#7781
fix(metrics): use RecordHistogramValue for task attempt count histograms#7781zawadzkidiana wants to merge 3 commits intocadence-workflow:masterfrom
Conversation
Switch ExponentialTaskAttemptCounts and ExponentialTaskAttemptCountsPerDomain from intExponentialBuckets + IntExponentialHistogram to ResponseRowSizeBuckets + RecordHistogramValue, aligning with the pattern used by ReplicationTasksFetchedSize. Signed-off-by: Diana Zawadzki <dzawa@live.de> Signed-off-by: Diana Zawadzki <40473085+zawadzkidiana@users.noreply.github.com>
Signed-off-by: Diana Zawadzki <dzawa@live.de> Signed-off-by: Diana Zawadzki <40473085+zawadzkidiana@users.noreply.github.com>
8ea0126 to
bf00743
Compare
common/metrics/defs.go
Outdated
| ExponentialTaskLatency: {metricName: "task_latency_ns", metricType: Histogram, exponentialBuckets: Low1ms100s}, | ||
| TaskAttemptTimer: {metricName: "task_attempt", metricType: Timer}, | ||
| ExponentialTaskAttemptCounts: {metricName: "task_attempt_counts", metricType: Histogram, intExponentialBuckets: Mid1To16k}, | ||
| ExponentialTaskAttemptCounts: {metricName: "task_attempt_counts", metricType: Histogram, buckets: ResponseRowSizeBuckets}, |
There was a problem hiding this comment.
Do you mind renaming it to TaskAttemptCounts?
Drop the "Exponential" prefix from TaskAttemptCounts and TaskAttemptCountsPerDomain per reviewer feedback, since these now use standard buckets instead of exponential buckets. Signed-off-by: Diana Zawadzki <40473085+zawadzkidiana@users.noreply.github.com>
|
The PR description needs a couple of improvements: Details[How did you test it?] — Please replace the vague statement with specific, copy-pasteable test commands so reviewers can reproduce the test steps. For example: [Why?] — Please explain what was functionally wrong or incorrect with the previous |
| ExponentialTaskLatency: {metricName: "task_latency_ns", metricType: Histogram, exponentialBuckets: Low1ms100s}, | ||
| TaskAttemptTimer: {metricName: "task_attempt", metricType: Timer}, | ||
| ExponentialTaskAttemptCounts: {metricName: "task_attempt_counts", metricType: Histogram, intExponentialBuckets: Mid1To16k}, | ||
| TaskAttemptCounts: {metricName: "task_attempt_counts", metricType: Histogram, buckets: ResponseRowSizeBuckets}, |
There was a problem hiding this comment.
⚠️ Bug: ResponseRowSizeBuckets is a poor fit for task attempt counts
ResponseRowSizeBuckets has boundaries at {0, 1, 2, 4, 8, 16, 32, 64, ..., 65536} — power-of-2 steps designed for response payload sizes. For task attempt counts, which are typically small integers (1–20 for most tasks), these buckets lose significant resolution: you can't distinguish 2 vs 3 attempts, or 5 vs 7 attempts. The old Mid1To16k had 4 sub-buckets per octave giving much finer granularity.
Consider defining dedicated buckets for attempt counts with better low-end resolution, e.g. linear buckets from 1–10 then exponential beyond that. Alternatively, use tally.MustMakeLinearValueBuckets(1, 1, 20) (buckets at 1,2,3,...,20) appended with a few exponential buckets for the tail, which would capture the full distribution of attempt counts where it matters most.
Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion
Code Review
|
| Auto-apply | Compact |
|
|
Was this helpful? React with 👍 / 👎 | Gitar
What changed
Switch
ExponentialTaskAttemptCountsandExponentialTaskAttemptCountsPerDomainhistogram definitions fromintExponentialBuckets: Mid1To16k+IntExponentialHistogramtobuckets: ResponseRowSizeBuckets+RecordHistogramValue, aligning with the pattern used byReplicationTasksFetchedSize.Why
The previous implementation used
IntExponentialHistogramwithintExponentialBuckets, but the correct pattern in this codebase for recording histogram values is to useRecordHistogramValuewith standardbuckets. This aligns the task attempt count metrics with the established pattern used by other histogram metrics.How did you test it
Existing unit tests cover the task attempt recording paths. The change is a straightforward API alignment with no behavioral change in the recorded values.
Potential risks
Low. The bucket boundaries change from
Mid1To16k(integer exponential buckets) toResponseRowSizeBuckets. Existing dashboards or alerts that rely on the old bucket boundaries may need updating.Release notes
No user-facing changes. Internal metrics instrumentation improvement.
Documentation changes
None required.