fix(metrics): use RecordHistogramValue for task attempt count histograms by zawadzkidiana · Pull Request #7781 · cadence-workflow/cadence

zawadzkidiana · 2026-03-03T23:00:31Z

What changed

Switch ExponentialTaskAttemptCounts and ExponentialTaskAttemptCountsPerDomain histogram definitions from intExponentialBuckets: Mid1To16k + IntExponentialHistogram to buckets: ResponseRowSizeBuckets + RecordHistogramValue, aligning with the pattern used by ReplicationTasksFetchedSize.

Why

The previous implementation used IntExponentialHistogram with intExponentialBuckets, but the correct pattern in this codebase for recording histogram values is to use RecordHistogramValue with standard buckets. This aligns the task attempt count metrics with the established pattern used by other histogram metrics.

How did you test it

Existing unit tests cover the task attempt recording paths. The change is a straightforward API alignment with no behavioral change in the recorded values.

Potential risks

Low. The bucket boundaries change from Mid1To16k (integer exponential buckets) to ResponseRowSizeBuckets. Existing dashboards or alerts that rely on the old bucket boundaries may need updating.

Release notes

No user-facing changes. Internal metrics instrumentation improvement.

Documentation changes

None required.

Switch ExponentialTaskAttemptCounts and ExponentialTaskAttemptCountsPerDomain from intExponentialBuckets + IntExponentialHistogram to ResponseRowSizeBuckets + RecordHistogramValue, aligning with the pattern used by ReplicationTasksFetchedSize. Signed-off-by: Diana Zawadzki <dzawa@live.de> Signed-off-by: Diana Zawadzki <40473085+zawadzkidiana@users.noreply.github.com>

Signed-off-by: Diana Zawadzki <dzawa@live.de> Signed-off-by: Diana Zawadzki <40473085+zawadzkidiana@users.noreply.github.com>

Shaddoll · 2026-03-14T00:04:28Z

common/metrics/defs.go

 		ExponentialTaskLatency:                        {metricName: "task_latency_ns", metricType: Histogram, exponentialBuckets: Low1ms100s},
 		TaskAttemptTimer:                              {metricName: "task_attempt", metricType: Timer},
-		ExponentialTaskAttemptCounts:                  {metricName: "task_attempt_counts", metricType: Histogram, intExponentialBuckets: Mid1To16k},
+		ExponentialTaskAttemptCounts:                  {metricName: "task_attempt_counts", metricType: Histogram, buckets: ResponseRowSizeBuckets},


Do you mind renaming it to TaskAttemptCounts?

Drop the "Exponential" prefix from TaskAttemptCounts and TaskAttemptCountsPerDomain per reviewer feedback, since these now use standard buckets instead of exponential buckets. Signed-off-by: Diana Zawadzki <40473085+zawadzkidiana@users.noreply.github.com>

gitar-bot · 2026-03-16T16:58:43Z

The PR description needs a couple of improvements:

Details

[How did you test it?] — Please replace the vague statement with specific, copy-pasteable test commands so reviewers can reproduce the test steps. For example:

go test -v ./service/history/task/... -run TestTaskAttempt

[Why?] — Please explain what was functionally wrong or incorrect with the previous IntExponentialHistogram + intExponentialBuckets: Mid1To16k approach, and why ResponseRowSizeBuckets is an appropriate bucket set for task attempt counts (or acknowledge it as a known trade-off if that's the case).

gitar-bot · 2026-03-16T16:59:42Z

common/metrics/defs.go

 		ExponentialTaskLatency:                        {metricName: "task_latency_ns", metricType: Histogram, exponentialBuckets: Low1ms100s},
 		TaskAttemptTimer:                              {metricName: "task_attempt", metricType: Timer},
-		ExponentialTaskAttemptCounts:                  {metricName: "task_attempt_counts", metricType: Histogram, intExponentialBuckets: Mid1To16k},
+		TaskAttemptCounts:                            {metricName: "task_attempt_counts", metricType: Histogram, buckets: ResponseRowSizeBuckets},


⚠️ Bug: ResponseRowSizeBuckets is a poor fit for task attempt counts

ResponseRowSizeBuckets has boundaries at {0, 1, 2, 4, 8, 16, 32, 64, ..., 65536} — power-of-2 steps designed for response payload sizes. For task attempt counts, which are typically small integers (1–20 for most tasks), these buckets lose significant resolution: you can't distinguish 2 vs 3 attempts, or 5 vs 7 attempts. The old Mid1To16k had 4 sub-buckets per octave giving much finer granularity.

Consider defining dedicated buckets for attempt counts with better low-end resolution, e.g. linear buckets from 1–10 then exponential beyond that. Alternatively, use tally.MustMakeLinearValueBuckets(1, 1, 20) (buckets at 1,2,3,...,20) appended with a few exponential buckets for the tail, which would capture the full distribution of attempt counts where it matters most.

_{Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion}

gitar-bot · 2026-03-16T16:59:52Z

Code Review ⚠️ Changes requested 0 resolved / 1 findings

Switches task attempt count histograms to use RecordHistogramValue, but ResponseRowSizeBuckets is a poor fit for this metric—its boundaries {0, 1, 2, 4, 8, 16, 32, 64...} don't align with typical task attempt distributions.

⚠️

Bug: ResponseRowSizeBuckets is a poor fit for task attempt counts

📄 common/metrics/defs.go:3336 📄 common/metrics/defs.go:3359

ResponseRowSizeBuckets has boundaries at {0, 1, 2, 4, 8, 16, 32, 64, ..., 65536} — power-of-2 steps designed for response payload sizes. For task attempt counts, which are typically small integers (1–20 for most tasks), these buckets lose significant resolution: you can't distinguish 2 vs 3 attempts, or 5 vs 7 attempts. The old Mid1To16k had 4 sub-buckets per octave giving much finer granularity.

Consider defining dedicated buckets for attempt counts with better low-end resolution, e.g. linear buckets from 1–10 then exponential beyond that. Alternatively, use tally.MustMakeLinearValueBuckets(1, 1, 20) (buckets at 1,2,3,...,20) appended with a few exponential buckets for the tail, which would capture the full distribution of attempt counts where it matters most.

🤖 Prompt for agents

Code Review: Switches task attempt count histograms to use RecordHistogramValue, but ResponseRowSizeBuckets is a poor fit for this metric—its boundaries {0, 1, 2, 4, 8, 16, 32, 64...} don't align with typical task attempt distributions.

1. ⚠️ Bug: ResponseRowSizeBuckets is a poor fit for task attempt counts
   Files: common/metrics/defs.go:3336, common/metrics/defs.go:3359

   `ResponseRowSizeBuckets` has boundaries at `{0, 1, 2, 4, 8, 16, 32, 64, ..., 65536}` — power-of-2 steps designed for response payload sizes. For task attempt counts, which are typically small integers (1–20 for most tasks), these buckets lose significant resolution: you can't distinguish 2 vs 3 attempts, or 5 vs 7 attempts. The old `Mid1To16k` had 4 sub-buckets per octave giving much finer granularity.
   
   Consider defining dedicated buckets for attempt counts with better low-end resolution, e.g. linear buckets from 1–10 then exponential beyond that. Alternatively, use `tally.MustMakeLinearValueBuckets(1, 1, 20)` (buckets at 1,2,3,...,20) appended with a few exponential buckets for the tail, which would capture the full distribution of attempt counts where it matters most.

Rules ❌ No requirements met

Repository Rules

❌ PR Description Quality Standards: The [How did you test it?] section lacks specific, copy-pasteable test commands, and the [Why?] section does not explain what was functionally wrong with the previous IntExponentialHistogram approach or why ResponseRowSizeBuckets is appropriate for task attempt counts.

_{1 rule not applicable. Show all rules by commenting gitar display:verbose.}

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

`Auto-apply`	`Compact`
`gitar auto-apply:on`	`gitar display:verbose`

_{Was this helpful? React with 👍 / 👎 | Gitar}

zawadzkidiana added 2 commits March 3, 2026 15:37

Fix indentation in HandleErr critical retry block

bf00743

Signed-off-by: Diana Zawadzki <dzawa@live.de> Signed-off-by: Diana Zawadzki <40473085+zawadzkidiana@users.noreply.github.com>

zawadzkidiana force-pushed the diana/histogram-task-attempt-counts branch from 8ea0126 to bf00743 Compare March 3, 2026 23:38

Shaddoll reviewed Mar 14, 2026

View reviewed changes

gitar-bot bot reviewed Mar 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(metrics): use RecordHistogramValue for task attempt count histograms#7781

fix(metrics): use RecordHistogramValue for task attempt count histograms#7781
zawadzkidiana wants to merge 3 commits intocadence-workflow:masterfrom
zawadzkidiana:diana/histogram-task-attempt-counts

zawadzkidiana commented Mar 3, 2026 •

edited by gitar-bot bot

Loading

Uh oh!

Shaddoll Mar 14, 2026

Uh oh!

gitar-bot bot commented Mar 16, 2026

Uh oh!

gitar-bot bot Mar 16, 2026

Uh oh!

gitar-bot bot commented Mar 16, 2026

Repository Rules

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

zawadzkidiana commented Mar 3, 2026 • edited by gitar-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed

Why

How did you test it

Potential risks

Release notes

Documentation changes

Uh oh!

Shaddoll Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

gitar-bot bot commented Mar 16, 2026

Uh oh!

gitar-bot bot Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

gitar-bot bot commented Mar 16, 2026

Repository Rules

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zawadzkidiana commented Mar 3, 2026 •

edited by gitar-bot bot

Loading