Skip to content

chore(shard-distributor): Improve observability of rebalancing loop#7837

Open
eleonoradgr wants to merge 4 commits intocadence-workflow:masterfrom
eleonoradgr:debug-rebalance-loop
Open

chore(shard-distributor): Improve observability of rebalancing loop#7837
eleonoradgr wants to merge 4 commits intocadence-workflow:masterfrom
eleonoradgr:debug-rebalance-loop

Conversation

@eleonoradgr
Copy link
Contributor

@eleonoradgr eleonoradgr commented Mar 20, 2026

What changed?

  • adding more logs for the shard movement based on load
  • moving the executor metric before the new rebalancing
    part of Cadence Shard Manager #6862
    Why?
    we are onboarding but we are missing context on which shard is getting moved and more insights on the coldest and hottest executor, these can help us understanding if many movements are affecting a single shard and if this becomes a problem for the shard availability.
    We want to have metrics how many executors are in a namespace even if there is not shard rebalancing

How did you test it?
unit tests
go test -v ./service/sharddistributor/leader/process

Potential risks
The log could be too much, but during this onboarding phase we can see and then adjust.

Release notes
N/A

Documentation Changes
N/A

Signed-off-by: edigregorio <edigregorio@uber.com>
Signed-off-by: edigregorio <edigregorio@uber.com>
Signed-off-by: edigregorio <edigregorio@uber.com>
Signed-off-by: edigregorio <edigregorio@uber.com>
@gitar-bot
Copy link

gitar-bot bot commented Mar 20, 2026

Please link a GitHub issue to this PR using one of the accepted formats:

  • Fixes #123
  • cadence-workflow/repo#123
  • https://github.com/cadence-workflow/repo/issues/123

@gitar-bot
Copy link

gitar-bot bot commented Mar 20, 2026

🔍 CI failure analysis for cee3718: 1 test failure in fuzzy coverage testing (TestRegisterDomainRequestFuzz) unrelated to this PR's observability changes; the fuzz test failed to generate the 'empty' case after 10k attempts, indicating an existing flaky or incomplete fuzz test that should be triaged separately.

Overview

Analyzed 1 CI log from job 67889279868. Found 1 test failure in fuzzy coverage generation that is unrelated to the PR's shard-distributor observability changes. The failure occurs in domain API testing and pre-dates this changeset.

Failures

TestRegisterDomainRequestFuzz - Incomplete Fuzzy Coverage (confidence: medium)

  • Type: test
  • Affected jobs: 67889279868
  • Related to change: no
  • Root cause: The fuzz test expected to generate three cases (nil, empty, filled) but only produced two (nil, filled) after 10,000 fuzzing attempts. The 'empty' case was never hit by the fuzzing engine, suggesting either inadequate input generation strategy or an unreachable code path for empty RegisterDomainRequest values.
  • Suggested fix: Review api_test.go:449 and fuzz_test_utils.go:45-46. Increase fuzzing attempts, adjust input generation to better cover the empty case, or verify that the empty code path is reachable. This test should be investigated separately as it is unrelated to the current PR's metrics and observability improvements.

Summary

  • Change-related failures: 0 — PR changes to metrics definitions and shard-distributor processor are unaffected
  • Infrastructure/flaky failures: 1 — Existing flaky fuzz test in domain API that should be triaged as a pre-existing issue
  • Recommended action: This failure is unrelated to the observability changes in this PR. Triage the fuzzy coverage test separately; it may be a known flaky test or require adjustment to the fuzzing strategy. The PR changes are safe to proceed with.
Code Review 👍 Approved with suggestions 0 resolved / 1 findings

Improves observability of the shard-distributor rebalancing loop with additional metrics and executor telemetry. Consider handling the division by zero case in load_ratio logging when the coldest executor has no shards to avoid +Inf values.

💡 Edge Case: Division by zero in load_ratio log when coldest load is 0

📄 service/sharddistributor/leader/process/processor.go:670

At line 670, hottestExecutorLoad/coldestExecutorLoad will produce +Inf when the coldest executor has no shards (load = 0.0). This is a pre-existing issue (the same division already exists at line 652 in the MaxDeviation check), but the new log line will emit a meaningless +Inf value for the load_ratio dynamic tag. Consider guarding against zero to emit a more useful metric/log value.

Suggested fix
tag.Dynamic("load_ratio", func() float64 {
    if coldestExecutorLoad == 0 {
        return math.Inf(1)
    }
    return hottestExecutorLoad / coldestExecutorLoad
}()),
🤖 Prompt for agents
Code Review: Improves observability of the shard-distributor rebalancing loop with additional metrics and executor telemetry. Consider handling the division by zero case in load_ratio logging when the coldest executor has no shards to avoid `+Inf` values.

1. 💡 Edge Case: Division by zero in load_ratio log when coldest load is 0
   Files: service/sharddistributor/leader/process/processor.go:670

   At line 670, `hottestExecutorLoad/coldestExecutorLoad` will produce `+Inf` when the coldest executor has no shards (load = 0.0). This is a pre-existing issue (the same division already exists at line 652 in the MaxDeviation check), but the new log line will emit a meaningless `+Inf` value for the `load_ratio` dynamic tag. Consider guarding against zero to emit a more useful metric/log value.

   Suggested fix:
   tag.Dynamic("load_ratio", func() float64 {
       if coldestExecutorLoad == 0 {
           return math.Inf(1)
       }
       return hottestExecutorLoad / coldestExecutorLoad
   }()),

Rules ⚠️ 1/2 requirements met

Repository Rules

GitHub Issue Linking Requirement: Add a reference to a GitHub issue in the PR description using one of these formats: 'Fixes #123', 'cadence-workflow/repo#123', or 'https://github.com/cadence-workflow/repo/issues/123'.
PR Description Quality Standards: All required sections are present and substantive including What changed, Why, How did you test it, Potential risks, and Release notes.

1 rule not applicable. Show all rules by commenting gitar display:verbose.

Tip

Comment Gitar fix CI or enable auto-apply: gitar auto-apply:on

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

Auto-apply Compact
gitar auto-apply:on         
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

tag.Dynamic("coldest_executor_load", coldestExecutorLoad),
tag.Dynamic("load_ratio", hottestExecutorLoad/coldestExecutorLoad),
tag.Dynamic("hottest_executor_shard_count", len(currentAssignments[hottestExecutorID])),
tag.Dynamic("coldest_executor_shard_count", len(currentAssignments[coldestExecutorID])),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Edge Case: Division by zero in load_ratio log when coldest load is 0

At line 670, hottestExecutorLoad/coldestExecutorLoad will produce +Inf when the coldest executor has no shards (load = 0.0). This is a pre-existing issue (the same division already exists at line 652 in the MaxDeviation check), but the new log line will emit a meaningless +Inf value for the load_ratio dynamic tag. Consider guarding against zero to emit a more useful metric/log value.

Suggested fix:

tag.Dynamic("load_ratio", func() float64 {
    if coldestExecutorLoad == 0 {
        return math.Inf(1)
    }
    return hottestExecutorLoad / coldestExecutorLoad
}()),

Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a float in go it will be +Inf

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But how it's gonna look like in logs? Are we able to search for +inf? Is this useful?

tag.Dynamic("coldest_executor_load", coldestExecutorLoad),
tag.Dynamic("load_ratio", hottestExecutorLoad/coldestExecutorLoad),
tag.Dynamic("hottest_executor_shard_count", len(currentAssignments[hottestExecutorID])),
tag.Dynamic("coldest_executor_shard_count", len(currentAssignments[coldestExecutorID])),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But how it's gonna look like in logs? Are we able to search for +inf? Is this useful?

// rebalanceByShardLoad does a rebalance if a difference between hottest and coldest executors' loads is more than maxDeviation
// in this case the hottest shard will be moved to the coldest executor
func (p *namespaceProcessor) rebalanceByShardLoad(shardLoad map[string]float64, currentAssignments map[string][]string) (distributedChanged bool) {
func (p *namespaceProcessor) rebalanceByShardLoad(shardLoad map[string]float64, currentAssignments map[string][]string, metricsScope metrics.Scope) (distributedChanged bool) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... why metricsScope is not provided to the constructor as p.logger ?

if len(deletedShards) > 0 {
p.logger.Info("Identified deleted shards", tag.ShardExecutors(slices.Collect(maps.Keys(deletedShards))))
}
metricsLoopScope.UpdateGauge(metrics.ShardDistributorAssignLoopDeletedShards, float64(len(deletedShards)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this Gauge and not Counter?
It's not intuitive the number of deleted shards in the current operation is "a state" rather then "stats".
Maybe I don't understand what's the deleted shard means.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants