chore(shard-distributor): Improve observability of rebalancing loop#7837
chore(shard-distributor): Improve observability of rebalancing loop#7837eleonoradgr wants to merge 4 commits intocadence-workflow:masterfrom
Conversation
Signed-off-by: edigregorio <edigregorio@uber.com>
Signed-off-by: edigregorio <edigregorio@uber.com>
Signed-off-by: edigregorio <edigregorio@uber.com>
Signed-off-by: edigregorio <edigregorio@uber.com>
|
Please link a GitHub issue to this PR using one of the accepted formats:
|
🔍 CI failure analysis for cee3718: 1 test failure in fuzzy coverage testing (TestRegisterDomainRequestFuzz) unrelated to this PR's observability changes; the fuzz test failed to generate the 'empty' case after 10k attempts, indicating an existing flaky or incomplete fuzz test that should be triaged separately.OverviewAnalyzed 1 CI log from job 67889279868. Found 1 test failure in fuzzy coverage generation that is unrelated to the PR's shard-distributor observability changes. The failure occurs in domain API testing and pre-dates this changeset. FailuresTestRegisterDomainRequestFuzz - Incomplete Fuzzy Coverage (confidence: medium)
Summary
Code Review 👍 Approved with suggestions 0 resolved / 1 findingsImproves observability of the shard-distributor rebalancing loop with additional metrics and executor telemetry. Consider handling the division by zero case in load_ratio logging when the coldest executor has no shards to avoid 💡 Edge Case: Division by zero in load_ratio log when coldest load is 0📄 service/sharddistributor/leader/process/processor.go:670 At line 670, Suggested fix🤖 Prompt for agentsRules
|
| Auto-apply | Compact |
|
|
Was this helpful? React with 👍 / 👎 | Gitar
| tag.Dynamic("coldest_executor_load", coldestExecutorLoad), | ||
| tag.Dynamic("load_ratio", hottestExecutorLoad/coldestExecutorLoad), | ||
| tag.Dynamic("hottest_executor_shard_count", len(currentAssignments[hottestExecutorID])), | ||
| tag.Dynamic("coldest_executor_shard_count", len(currentAssignments[coldestExecutorID])), |
There was a problem hiding this comment.
💡 Edge Case: Division by zero in load_ratio log when coldest load is 0
At line 670, hottestExecutorLoad/coldestExecutorLoad will produce +Inf when the coldest executor has no shards (load = 0.0). This is a pre-existing issue (the same division already exists at line 652 in the MaxDeviation check), but the new log line will emit a meaningless +Inf value for the load_ratio dynamic tag. Consider guarding against zero to emit a more useful metric/log value.
Suggested fix:
tag.Dynamic("load_ratio", func() float64 {
if coldestExecutorLoad == 0 {
return math.Inf(1)
}
return hottestExecutorLoad / coldestExecutorLoad
}()),
Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion
There was a problem hiding this comment.
This is a float in go it will be +Inf
There was a problem hiding this comment.
But how it's gonna look like in logs? Are we able to search for +inf? Is this useful?
| tag.Dynamic("coldest_executor_load", coldestExecutorLoad), | ||
| tag.Dynamic("load_ratio", hottestExecutorLoad/coldestExecutorLoad), | ||
| tag.Dynamic("hottest_executor_shard_count", len(currentAssignments[hottestExecutorID])), | ||
| tag.Dynamic("coldest_executor_shard_count", len(currentAssignments[coldestExecutorID])), |
There was a problem hiding this comment.
But how it's gonna look like in logs? Are we able to search for +inf? Is this useful?
| // rebalanceByShardLoad does a rebalance if a difference between hottest and coldest executors' loads is more than maxDeviation | ||
| // in this case the hottest shard will be moved to the coldest executor | ||
| func (p *namespaceProcessor) rebalanceByShardLoad(shardLoad map[string]float64, currentAssignments map[string][]string) (distributedChanged bool) { | ||
| func (p *namespaceProcessor) rebalanceByShardLoad(shardLoad map[string]float64, currentAssignments map[string][]string, metricsScope metrics.Scope) (distributedChanged bool) { |
There was a problem hiding this comment.
Hmm... why metricsScope is not provided to the constructor as p.logger ?
| if len(deletedShards) > 0 { | ||
| p.logger.Info("Identified deleted shards", tag.ShardExecutors(slices.Collect(maps.Keys(deletedShards)))) | ||
| } | ||
| metricsLoopScope.UpdateGauge(metrics.ShardDistributorAssignLoopDeletedShards, float64(len(deletedShards))) |
There was a problem hiding this comment.
why is this Gauge and not Counter?
It's not intuitive the number of deleted shards in the current operation is "a state" rather then "stats".
Maybe I don't understand what's the deleted shard means.
What changed?
part of Cadence Shard Manager #6862
Why?
we are onboarding but we are missing context on which shard is getting moved and more insights on the coldest and hottest executor, these can help us understanding if many movements are affecting a single shard and if this becomes a problem for the shard availability.
We want to have metrics how many executors are in a namespace even if there is not shard rebalancing
How did you test it?
unit tests
go test -v ./service/sharddistributor/leader/process
Potential risks
The log could be too much, but during this onboarding phase we can see and then adjust.
Release notes
N/A
Documentation Changes
N/A