Skip to content

fix(metrics): use RecordHistogramValue for task attempt count histograms#7781

Open
zawadzkidiana wants to merge 3 commits intocadence-workflow:masterfrom
zawadzkidiana:diana/histogram-task-attempt-counts
Open

fix(metrics): use RecordHistogramValue for task attempt count histograms#7781
zawadzkidiana wants to merge 3 commits intocadence-workflow:masterfrom
zawadzkidiana:diana/histogram-task-attempt-counts

Conversation

@zawadzkidiana
Copy link
Contributor

@zawadzkidiana zawadzkidiana commented Mar 3, 2026

What changed

Switch ExponentialTaskAttemptCounts and ExponentialTaskAttemptCountsPerDomain histogram definitions from intExponentialBuckets: Mid1To16k + IntExponentialHistogram to buckets: ResponseRowSizeBuckets + RecordHistogramValue, aligning with the pattern used by ReplicationTasksFetchedSize.

Why

The previous implementation used IntExponentialHistogram with intExponentialBuckets, but the correct pattern in this codebase for recording histogram values is to use RecordHistogramValue with standard buckets. This aligns the task attempt count metrics with the established pattern used by other histogram metrics.

How did you test it

Existing unit tests cover the task attempt recording paths. The change is a straightforward API alignment with no behavioral change in the recorded values.

Potential risks

Low. The bucket boundaries change from Mid1To16k (integer exponential buckets) to ResponseRowSizeBuckets. Existing dashboards or alerts that rely on the old bucket boundaries may need updating.

Release notes

No user-facing changes. Internal metrics instrumentation improvement.

Documentation changes

None required.

Switch ExponentialTaskAttemptCounts and ExponentialTaskAttemptCountsPerDomain
from intExponentialBuckets + IntExponentialHistogram to ResponseRowSizeBuckets
+ RecordHistogramValue, aligning with the pattern used by ReplicationTasksFetchedSize.

Signed-off-by: Diana Zawadzki <dzawa@live.de>
Signed-off-by: Diana Zawadzki <40473085+zawadzkidiana@users.noreply.github.com>
Signed-off-by: Diana Zawadzki <dzawa@live.de>
Signed-off-by: Diana Zawadzki <40473085+zawadzkidiana@users.noreply.github.com>
@zawadzkidiana zawadzkidiana force-pushed the diana/histogram-task-attempt-counts branch from 8ea0126 to bf00743 Compare March 3, 2026 23:38
ExponentialTaskLatency: {metricName: "task_latency_ns", metricType: Histogram, exponentialBuckets: Low1ms100s},
TaskAttemptTimer: {metricName: "task_attempt", metricType: Timer},
ExponentialTaskAttemptCounts: {metricName: "task_attempt_counts", metricType: Histogram, intExponentialBuckets: Mid1To16k},
ExponentialTaskAttemptCounts: {metricName: "task_attempt_counts", metricType: Histogram, buckets: ResponseRowSizeBuckets},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind renaming it to TaskAttemptCounts?

Drop the "Exponential" prefix from TaskAttemptCounts and
TaskAttemptCountsPerDomain per reviewer feedback, since these
now use standard buckets instead of exponential buckets.

Signed-off-by: Diana Zawadzki <40473085+zawadzkidiana@users.noreply.github.com>
@gitar-bot
Copy link

gitar-bot bot commented Mar 16, 2026

The PR description needs a couple of improvements:

Details

[How did you test it?] — Please replace the vague statement with specific, copy-pasteable test commands so reviewers can reproduce the test steps. For example:

go test -v ./service/history/task/... -run TestTaskAttempt

[Why?] — Please explain what was functionally wrong or incorrect with the previous IntExponentialHistogram + intExponentialBuckets: Mid1To16k approach, and why ResponseRowSizeBuckets is an appropriate bucket set for task attempt counts (or acknowledge it as a known trade-off if that's the case).

ExponentialTaskLatency: {metricName: "task_latency_ns", metricType: Histogram, exponentialBuckets: Low1ms100s},
TaskAttemptTimer: {metricName: "task_attempt", metricType: Timer},
ExponentialTaskAttemptCounts: {metricName: "task_attempt_counts", metricType: Histogram, intExponentialBuckets: Mid1To16k},
TaskAttemptCounts: {metricName: "task_attempt_counts", metricType: Histogram, buckets: ResponseRowSizeBuckets},
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Bug: ResponseRowSizeBuckets is a poor fit for task attempt counts

ResponseRowSizeBuckets has boundaries at {0, 1, 2, 4, 8, 16, 32, 64, ..., 65536} — power-of-2 steps designed for response payload sizes. For task attempt counts, which are typically small integers (1–20 for most tasks), these buckets lose significant resolution: you can't distinguish 2 vs 3 attempts, or 5 vs 7 attempts. The old Mid1To16k had 4 sub-buckets per octave giving much finer granularity.

Consider defining dedicated buckets for attempt counts with better low-end resolution, e.g. linear buckets from 1–10 then exponential beyond that. Alternatively, use tally.MustMakeLinearValueBuckets(1, 1, 20) (buckets at 1,2,3,...,20) appended with a few exponential buckets for the tail, which would capture the full distribution of attempt counts where it matters most.

Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion

@gitar-bot
Copy link

gitar-bot bot commented Mar 16, 2026

Code Review ⚠️ Changes requested 0 resolved / 1 findings

Switches task attempt count histograms to use RecordHistogramValue, but ResponseRowSizeBuckets is a poor fit for this metric—its boundaries {0, 1, 2, 4, 8, 16, 32, 64...} don't align with typical task attempt distributions.

⚠️ Bug: ResponseRowSizeBuckets is a poor fit for task attempt counts

📄 common/metrics/defs.go:3336 📄 common/metrics/defs.go:3359

ResponseRowSizeBuckets has boundaries at {0, 1, 2, 4, 8, 16, 32, 64, ..., 65536} — power-of-2 steps designed for response payload sizes. For task attempt counts, which are typically small integers (1–20 for most tasks), these buckets lose significant resolution: you can't distinguish 2 vs 3 attempts, or 5 vs 7 attempts. The old Mid1To16k had 4 sub-buckets per octave giving much finer granularity.

Consider defining dedicated buckets for attempt counts with better low-end resolution, e.g. linear buckets from 1–10 then exponential beyond that. Alternatively, use tally.MustMakeLinearValueBuckets(1, 1, 20) (buckets at 1,2,3,...,20) appended with a few exponential buckets for the tail, which would capture the full distribution of attempt counts where it matters most.

🤖 Prompt for agents
Code Review: Switches task attempt count histograms to use RecordHistogramValue, but ResponseRowSizeBuckets is a poor fit for this metric—its boundaries {0, 1, 2, 4, 8, 16, 32, 64...} don't align with typical task attempt distributions.

1. ⚠️ Bug: ResponseRowSizeBuckets is a poor fit for task attempt counts
   Files: common/metrics/defs.go:3336, common/metrics/defs.go:3359

   `ResponseRowSizeBuckets` has boundaries at `{0, 1, 2, 4, 8, 16, 32, 64, ..., 65536}` — power-of-2 steps designed for response payload sizes. For task attempt counts, which are typically small integers (1–20 for most tasks), these buckets lose significant resolution: you can't distinguish 2 vs 3 attempts, or 5 vs 7 attempts. The old `Mid1To16k` had 4 sub-buckets per octave giving much finer granularity.
   
   Consider defining dedicated buckets for attempt counts with better low-end resolution, e.g. linear buckets from 1–10 then exponential beyond that. Alternatively, use `tally.MustMakeLinearValueBuckets(1, 1, 20)` (buckets at 1,2,3,...,20) appended with a few exponential buckets for the tail, which would capture the full distribution of attempt counts where it matters most.

Rules ❌ No requirements met

Repository Rules

PR Description Quality Standards: The [How did you test it?] section lacks specific, copy-pasteable test commands, and the [Why?] section does not explain what was functionally wrong with the previous IntExponentialHistogram approach or why ResponseRowSizeBuckets is appropriate for task attempt counts.

1 rule not applicable. Show all rules by commenting gitar display:verbose.

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

Auto-apply Compact
gitar auto-apply:on         
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants