chore(shard-distributor): Improve observability of rebalancing loop by eleonoradgr · Pull Request #7837 · cadence-workflow/cadence

eleonoradgr · 2026-03-20T10:46:38Z

What changed?

adding more logs for the shard movement based on load
moving the executor metric before the new rebalancing
part of Cadence Shard Manager #6862
Why?
we are onboarding but we are missing context on which shard is getting moved and more insights on the coldest and hottest executor, these can help us understanding if many movements are affecting a single shard and if this becomes a problem for the shard availability.
We want to have metrics how many executors are in a namespace even if there is not shard rebalancing

How did you test it?
unit tests
go test -v ./service/sharddistributor/leader/process

Potential risks
The log could be too much, but during this onboarding phase we can see and then adjust.

Release notes
N/A

Documentation Changes
N/A

Signed-off-by: edigregorio <edigregorio@uber.com>

gitar-bot · 2026-03-20T10:47:39Z

Please link a GitHub issue to this PR using one of the accepted formats:

Fixes #123
cadence-workflow/repo#123
https://github.com/cadence-workflow/repo/issues/123

gitar-bot · 2026-03-20T10:47:48Z

🔍 CI failure analysis for cee3718: 1 test failure in fuzzy coverage testing (TestRegisterDomainRequestFuzz) unrelated to this PR's observability changes; the fuzz test failed to generate the 'empty' case after 10k attempts, indicating an existing flaky or incomplete fuzz test that should be triaged separately.

Overview

Analyzed 1 CI log from job 67889279868. Found 1 test failure in fuzzy coverage generation that is unrelated to the PR's shard-distributor observability changes. The failure occurs in domain API testing and pre-dates this changeset.

Failures

TestRegisterDomainRequestFuzz - Incomplete Fuzzy Coverage (confidence: medium)

Type: test
Affected jobs: 67889279868
Related to change: no
Root cause: The fuzz test expected to generate three cases (nil, empty, filled) but only produced two (nil, filled) after 10,000 fuzzing attempts. The 'empty' case was never hit by the fuzzing engine, suggesting either inadequate input generation strategy or an unreachable code path for empty RegisterDomainRequest values.
Suggested fix: Review api_test.go:449 and fuzz_test_utils.go:45-46. Increase fuzzing attempts, adjust input generation to better cover the empty case, or verify that the empty code path is reachable. This test should be investigated separately as it is unrelated to the current PR's metrics and observability improvements.

Summary

Change-related failures: 0 — PR changes to metrics definitions and shard-distributor processor are unaffected
Infrastructure/flaky failures: 1 — Existing flaky fuzz test in domain API that should be triaged as a pre-existing issue
Recommended action: This failure is unrelated to the observability changes in this PR. Triage the fuzzy coverage test separately; it may be a known flaky test or require adjustment to the fuzzing strategy. The PR changes are safe to proceed with.

Code Review 👍 Approved with suggestions 0 resolved / 1 findings

Improves observability of the shard-distributor rebalancing loop with additional metrics and executor telemetry. Consider handling the division by zero case in load_ratio logging when the coldest executor has no shards to avoid +Inf values.

💡 Edge Case: Division by zero in load_ratio log when coldest load is 0

📄 service/sharddistributor/leader/process/processor.go:670

At line 670, hottestExecutorLoad/coldestExecutorLoad will produce +Inf when the coldest executor has no shards (load = 0.0). This is a pre-existing issue (the same division already exists at line 652 in the MaxDeviation check), but the new log line will emit a meaningless +Inf value for the load_ratio dynamic tag. Consider guarding against zero to emit a more useful metric/log value.

Suggested fix

tag.Dynamic("load_ratio", func() float64 {
    if coldestExecutorLoad == 0 {
        return math.Inf(1)
    }
    return hottestExecutorLoad / coldestExecutorLoad
}()),

🤖 Prompt for agents

Code Review: Improves observability of the shard-distributor rebalancing loop with additional metrics and executor telemetry. Consider handling the division by zero case in load_ratio logging when the coldest executor has no shards to avoid `+Inf` values.

1. 💡 Edge Case: Division by zero in load_ratio log when coldest load is 0
   Files: service/sharddistributor/leader/process/processor.go:670

   At line 670, `hottestExecutorLoad/coldestExecutorLoad` will produce `+Inf` when the coldest executor has no shards (load = 0.0). This is a pre-existing issue (the same division already exists at line 652 in the MaxDeviation check), but the new log line will emit a meaningless `+Inf` value for the `load_ratio` dynamic tag. Consider guarding against zero to emit a more useful metric/log value.

   Suggested fix:
   tag.Dynamic("load_ratio", func() float64 {
       if coldestExecutorLoad == 0 {
           return math.Inf(1)
       }
       return hottestExecutorLoad / coldestExecutorLoad
   }()),

Rules ⚠️ 1/2 requirements met

Repository Rules

❌ GitHub Issue Linking Requirement: Add a reference to a GitHub issue in the PR description using one of these formats: 'Fixes #123', 'cadence-workflow/repo#123', or 'https://github.com/cadence-workflow/repo/issues/123'.

✅ PR Description Quality Standards: All required sections are present and substantive including What changed, Why, How did you test it, Potential risks, and Release notes.

_{1 rule not applicable. Show all rules by commenting gitar display:verbose.}

Tip

Comment Gitar fix CI or enable auto-apply: gitar auto-apply:on

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

`Auto-apply`	`Compact`
`gitar auto-apply:on`	`gitar display:verbose`

_{Was this helpful? React with 👍 / 👎 | Gitar}

gitar-bot · 2026-03-20T10:47:55Z

service/sharddistributor/leader/process/processor.go

+		tag.Dynamic("coldest_executor_load", coldestExecutorLoad),
+		tag.Dynamic("load_ratio", hottestExecutorLoad/coldestExecutorLoad),
+		tag.Dynamic("hottest_executor_shard_count", len(currentAssignments[hottestExecutorID])),
+		tag.Dynamic("coldest_executor_shard_count", len(currentAssignments[coldestExecutorID])),


💡 Edge Case: Division by zero in load_ratio log when coldest load is 0

At line 670, hottestExecutorLoad/coldestExecutorLoad will produce +Inf when the coldest executor has no shards (load = 0.0). This is a pre-existing issue (the same division already exists at line 652 in the MaxDeviation check), but the new log line will emit a meaningless +Inf value for the load_ratio dynamic tag. Consider guarding against zero to emit a more useful metric/log value.

Suggested fix:

tag.Dynamic("load_ratio", func() float64 { if coldestExecutorLoad == 0 { return math.Inf(1) } return hottestExecutorLoad / coldestExecutorLoad }()),

_{Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion}

This is a float in go it will be +Inf

But how it's gonna look like in logs? Are we able to search for +inf? Is this useful?

dkrotx · 2026-03-20T17:10:35Z

service/sharddistributor/leader/process/processor.go

+		tag.Dynamic("coldest_executor_load", coldestExecutorLoad),
+		tag.Dynamic("load_ratio", hottestExecutorLoad/coldestExecutorLoad),
+		tag.Dynamic("hottest_executor_shard_count", len(currentAssignments[hottestExecutorID])),
+		tag.Dynamic("coldest_executor_shard_count", len(currentAssignments[coldestExecutorID])),


But how it's gonna look like in logs? Are we able to search for +inf? Is this useful?

dkrotx · 2026-03-20T17:11:47Z

service/sharddistributor/leader/process/processor.go

 // rebalanceByShardLoad does a rebalance if a difference between hottest and coldest executors' loads is more than maxDeviation
 // in this case the hottest shard will be moved to the coldest executor
-func (p *namespaceProcessor) rebalanceByShardLoad(shardLoad map[string]float64, currentAssignments map[string][]string) (distributedChanged bool) {
+func (p *namespaceProcessor) rebalanceByShardLoad(shardLoad map[string]float64, currentAssignments map[string][]string, metricsScope metrics.Scope) (distributedChanged bool) {


Hmm... why metricsScope is not provided to the constructor as p.logger ?

dkrotx · 2026-03-20T17:14:14Z

service/sharddistributor/leader/process/processor.go

+	if len(deletedShards) > 0 {
+		p.logger.Info("Identified deleted shards", tag.ShardExecutors(slices.Collect(maps.Keys(deletedShards))))
+	}
+	metricsLoopScope.UpdateGauge(metrics.ShardDistributorAssignLoopDeletedShards, float64(len(deletedShards)))


why is this Gauge and not Counter?
It's not intuitive the number of deleted shards in the current operation is "a state" rather then "stats".
Maybe I don't understand what's the deleted shard means.

eleonoradgr added 4 commits March 20, 2026 11:34

feat(shard-distributor): add metrics for rebalancing loop

0ac083f

Signed-off-by: edigregorio <edigregorio@uber.com>

add more info

897d797

Signed-off-by: edigregorio <edigregorio@uber.com>

move executor metrics

cd4fa6b

Signed-off-by: edigregorio <edigregorio@uber.com>

move executor metrics

cee3718

Signed-off-by: edigregorio <edigregorio@uber.com>

eleonoradgr requested review from Assem-Uber, Shaddoll, abhishekj720, adhityamamallan, arzonus, bowenxia, c-warren, davidporter-id-au, demirkayaender, dkrotx, fimanishi, gazi-yestemirova, jakobht, macrotim, natemort, neil-xie, sankari165, shijiesheng and timl3136 as code owners March 20, 2026 10:46

gitar-bot bot reviewed Mar 20, 2026

View reviewed changes

dkrotx reviewed Mar 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(shard-distributor): Improve observability of rebalancing loop#7837

chore(shard-distributor): Improve observability of rebalancing loop#7837
eleonoradgr wants to merge 4 commits intocadence-workflow:masterfrom
eleonoradgr:debug-rebalance-loop

eleonoradgr commented Mar 20, 2026 •

edited

Loading

Uh oh!

gitar-bot bot commented Mar 20, 2026

Uh oh!

gitar-bot bot commented Mar 20, 2026 •

edited

Loading

Overview

Failures

TestRegisterDomainRequestFuzz - Incomplete Fuzzy Coverage (confidence: medium)

Summary

Repository Rules

Uh oh!

gitar-bot bot Mar 20, 2026

Uh oh!

eleonoradgr Mar 20, 2026

Uh oh!

dkrotx Mar 20, 2026

Uh oh!

dkrotx Mar 20, 2026

Uh oh!

dkrotx Mar 20, 2026

Uh oh!

dkrotx Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

eleonoradgr commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gitar-bot bot commented Mar 20, 2026

Uh oh!

gitar-bot bot commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Failures

TestRegisterDomainRequestFuzz - Incomplete Fuzzy Coverage (confidence: medium)

Summary

Repository Rules

Uh oh!

gitar-bot bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

eleonoradgr Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

dkrotx Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

dkrotx Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

dkrotx Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

dkrotx Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

eleonoradgr commented Mar 20, 2026 •

edited

Loading

gitar-bot bot commented Mar 20, 2026 •

edited

Loading