Fix: Search ranking - usage/tier boosts should act as tiebreakers, not override text relevance#26941
Fix: Search ranking - usage/tier boosts should act as tiebreakers, not override text relevance#26941mohityadav766 wants to merge 1 commit intomainfrom
Conversation
Code Review ✅ ApprovedSearch ranking fix ensures usage/tier boosts function as tiebreakers rather than overriding text relevance, improving result quality. No issues found. OptionsDisplay: compact → Showing less information. Comment with these commands to change:
Was this helpful? React with 👍 / 👎 | Gitar |
There was a problem hiding this comment.
Pull request overview
Adjusts the search scoring model for the dataAsset composite search so that text relevance remains the primary ranking signal, while tier/usage/votes only provide small tie-breaker influence.
Changes:
- Updated default
searchSettings.jsonto drastically reduce tier/usage/vote boost magnitudes and switchboostModefromsumtomultiplyacross asset configurations. - Updated both OpenSearch and Elasticsearch source builder factories to use
BoostMode.MULTIPLYfor the composite “all assets” configuration. - Added a baseline
weight=1.0function to prevent multiplicative scoring from collapsing scores to zero when no boost functions match.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| openmetadata-service/src/main/resources/json/data/settings/searchSettings.json | Changes boosts to “tiebreaker range” and switches per-asset boostMode to multiply. |
| openmetadata-service/src/main/java/org/openmetadata/service/search/opensearch/OpenSearchSourceBuilderFactory.java | Sets composite boost mode to multiply and adds baseline function score weight. |
| openmetadata-service/src/main/java/org/openmetadata/service/search/elasticsearch/ElasticSearchSourceBuilderFactory.java | Sets composite boost mode to multiply and adds baseline function score weight. |
| private List<es.co.elastic.clients.elasticsearch._types.query_dsl.FunctionScore> | ||
| collectBoostFunctionsV2(AssetTypeConfiguration assetConfig) { | ||
| List<es.co.elastic.clients.elasticsearch._types.query_dsl.FunctionScore> functions = | ||
| new ArrayList<>(); | ||
|
|
||
| // Add baseline weight of 1.0 so that assets with no tier/usage retain their text score | ||
| // when boostMode is multiply. Without this, function_score could be 0 and zero out the | ||
| // text relevance score. | ||
| functions.add(ElasticQueryBuilder.weightFunction(ElasticQueryBuilder.matchAllQuery(), 1.0)); | ||
|
|
There was a problem hiding this comment.
collectBoostFunctionsV2 now always adds a match_all weight(1.0) function, which means applyFunctionScoringV2 will always wrap the base query in a function_score query even when there are no configured term/field boosts. This is an observable behavior change (different query shape) and adds overhead; it also changes scoring semantics if a non-multiply boostMode is configured in the future (e.g., sum would add +1 to all docs). Consider only adding the baseline function when boostMode is multiply (and only when there is at least one other boost function to apply), otherwise return the base query unchanged.
| // Add baseline weight of 1.0 so that assets with no tier/usage retain their text score | ||
| // when boostMode is multiply. Without this, function_score could be 0 and zero out the | ||
| // text relevance score. | ||
| functions.add( | ||
| OpenSearchQueryBuilder.weightFunction(OpenSearchQueryBuilder.matchAllQuery(), 1.0)); |
There was a problem hiding this comment.
collectBoostFunctionsV2 now always adds a match_all weight(1.0) function, which means applyFunctionScoringV2 will always wrap the base query in a function_score query even when there are no configured term/field boosts. This is an observable behavior change (different query shape) and adds overhead; it also changes scoring semantics if a non-multiply boostMode is configured in the future (e.g., sum would add +1 to all docs). Consider only adding the baseline function when boostMode is multiply (and only when there is at least one other boost function to apply), otherwise return the base query unchanged.
| { | ||
| "field": "usageSummary.weeklyStats.count", | ||
| "factor": 4.0, | ||
| "modifier": "sqrt", | ||
| "factor": 0.002, | ||
| "modifier": "log1p", | ||
| "missing": 1 | ||
| }, |
There was a problem hiding this comment.
The weekly usage boost uses missing: 1 with modifier: log1p, which produces a non-zero boost for documents that don’t have usageSummary.weeklyStats.count (log1p(1)≈0.693). That contradicts the stated intent that assets with no usage retain their original text score under multiplicative scoring. If the goal is “no usage == no boost”, set missing to 0 (the new baseline weight already prevents multiplying by 0).
| compositeConfig.setSearchFields(new ArrayList<>(uniqueFields.values())); | ||
| compositeConfig.setTermBoosts(allTermBoosts); | ||
| compositeConfig.setFieldValueBoosts(allFieldValueBoosts); | ||
| compositeConfig.setScoreMode(AssetTypeConfiguration.ScoreMode.SUM); | ||
| compositeConfig.setBoostMode(AssetTypeConfiguration.BoostMode.SUM); | ||
| compositeConfig.setBoostMode(AssetTypeConfiguration.BoostMode.MULTIPLY); | ||
|
|
There was a problem hiding this comment.
There are existing unit tests for the search source builder factories, but none assert the new scoring semantics (boostMode multiply) or the presence/placement of the baseline weight function. Adding a targeted test that inspects the generated function_score query (boost_mode and first function == match_all weight 1.0) would help prevent regressions in ranking behavior across Elasticsearch/OpenSearch implementations.
| compositeConfig.setTermBoosts(allTermBoosts); | ||
| compositeConfig.setFieldValueBoosts(allFieldValueBoosts); | ||
| compositeConfig.setScoreMode(AssetTypeConfiguration.ScoreMode.SUM); | ||
| compositeConfig.setBoostMode(AssetTypeConfiguration.BoostMode.SUM); | ||
| compositeConfig.setBoostMode(AssetTypeConfiguration.BoostMode.MULTIPLY); | ||
|
|
There was a problem hiding this comment.
There are existing unit tests for the search source builder factories, but none assert the new scoring semantics (boostMode multiply) or the presence/placement of the baseline weight function. Adding a targeted test that inspects the generated function_score query (boost_mode and first function == match_all weight 1.0) would help prevent regressions in ranking behavior across Elasticsearch/OpenSearch implementations.
|
🔴 Playwright Results — 2 failure(s), 25 flaky✅ 3440 passed · ❌ 2 failed · 🟡 25 flaky · ⏭️ 223 skipped
Genuine Failures (failed on all attempts)❌
|



Fixes https://github.com/open-metadata/openmetadata-collate/issues/3468
Problem:
When searching the dataAsset composite index, entities with high usage (e.g., dashboards with 213K weekly views) were ranked far above entities with strong name matches. The
root cause was that usage boosts were additive (boostMode: sum), producing ~1847 raw points from sqrt(213214) × 4.0 alone — completely overwhelming text relevance scores of
~30-120 points. This caused dashboards to dominate search results even when tables had "location" directly in their name.
Solution:
Changed the scoring model from additive to multiplicative so that text relevance is the primary ranking signal and usage/tier only act as proportional tiebreakers:
(baseline + tier + usage)
multiplicative mode.
- Tier boosts: Tier1=0.05, Tier2=0.03, Tier3=0.01
- Usage count factor: 0.002 (log1p modifier)
- Percentile rank factor: 0.0005
- Votes factor: 0.005 (log1p modifier)
Behavior after fix: