Skip to content

Fix: Search ranking - usage/tier boosts should act as tiebreakers, not override text relevance#26941

Open
mohityadav766 wants to merge 1 commit intomainfrom
fix-search-rankings
Open

Fix: Search ranking - usage/tier boosts should act as tiebreakers, not override text relevance#26941
mohityadav766 wants to merge 1 commit intomainfrom
fix-search-rankings

Conversation

@mohityadav766
Copy link
Copy Markdown
Member

Fixes https://github.com/open-metadata/openmetadata-collate/issues/3468

Problem:
When searching the dataAsset composite index, entities with high usage (e.g., dashboards with 213K weekly views) were ranked far above entities with strong name matches. The
root cause was that usage boosts were additive (boostMode: sum), producing ~1847 raw points from sqrt(213214) × 4.0 alone — completely overwhelming text relevance scores of
~30-120 points. This caused dashboards to dominate search results even when tables had "location" directly in their name.

Solution:
Changed the scoring model from additive to multiplicative so that text relevance is the primary ranking signal and usage/tier only act as proportional tiebreakers:

  1. Switched boostMode from sum to multiply — function scores now multiply the text relevance score instead of adding to it. Formula becomes: final_score = text_score ×
    (baseline + tier + usage)
  2. Added a baseline weight function of 1.0 — ensures every document has a minimum multiplier of 1.0 so assets with no tier/usage retain their full text score under
    multiplicative mode.
  3. Tightened boost values to true tiebreaker range — max combined multiplier is now ~×1.22 (22% boost) instead of the previous ~1800+ additive points:
    - Tier boosts: Tier1=0.05, Tier2=0.03, Tier3=0.01
    - Usage count factor: 0.002 (log1p modifier)
    - Percentile rank factor: 0.0005
    - Votes factor: 0.005 (log1p modifier)

Behavior after fix:

  • Strong name match (text=100) always beats weak match (text=50) regardless of usage: 100 × 1.0 = 100 vs 50 × 1.22 = 61
  • Among equally matching assets, higher usage/tier breaks the tie: 80 × 1.07 = 86 vs 80 × 1.0 = 80
  • Tier1 assets get a slight edge over untiered assets with the same text match

Copilot AI review requested due to automatic review settings April 1, 2026 19:58
@github-actions github-actions bot added backend safe to test Add this label to run secure Github workflows on PRs labels Apr 1, 2026
@mohityadav766 mohityadav766 self-assigned this Apr 1, 2026
@mohityadav766 mohityadav766 moved this to In Review / QA 👀 in Shipping Apr 1, 2026
@gitar-bot
Copy link
Copy Markdown

gitar-bot bot commented Apr 1, 2026

Code Review ✅ Approved

Search ranking fix ensures usage/tier boosts function as tiebreakers rather than overriding text relevance, improving result quality. No issues found.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adjusts the search scoring model for the dataAsset composite search so that text relevance remains the primary ranking signal, while tier/usage/votes only provide small tie-breaker influence.

Changes:

  • Updated default searchSettings.json to drastically reduce tier/usage/vote boost magnitudes and switch boostMode from sum to multiply across asset configurations.
  • Updated both OpenSearch and Elasticsearch source builder factories to use BoostMode.MULTIPLY for the composite “all assets” configuration.
  • Added a baseline weight=1.0 function to prevent multiplicative scoring from collapsing scores to zero when no boost functions match.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
openmetadata-service/src/main/resources/json/data/settings/searchSettings.json Changes boosts to “tiebreaker range” and switches per-asset boostMode to multiply.
openmetadata-service/src/main/java/org/openmetadata/service/search/opensearch/OpenSearchSourceBuilderFactory.java Sets composite boost mode to multiply and adds baseline function score weight.
openmetadata-service/src/main/java/org/openmetadata/service/search/elasticsearch/ElasticSearchSourceBuilderFactory.java Sets composite boost mode to multiply and adds baseline function score weight.

Comment on lines 831 to +840
private List<es.co.elastic.clients.elasticsearch._types.query_dsl.FunctionScore>
collectBoostFunctionsV2(AssetTypeConfiguration assetConfig) {
List<es.co.elastic.clients.elasticsearch._types.query_dsl.FunctionScore> functions =
new ArrayList<>();

// Add baseline weight of 1.0 so that assets with no tier/usage retain their text score
// when boostMode is multiply. Without this, function_score could be 0 and zero out the
// text relevance score.
functions.add(ElasticQueryBuilder.weightFunction(ElasticQueryBuilder.matchAllQuery(), 1.0));

Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

collectBoostFunctionsV2 now always adds a match_all weight(1.0) function, which means applyFunctionScoringV2 will always wrap the base query in a function_score query even when there are no configured term/field boosts. This is an observable behavior change (different query shape) and adds overhead; it also changes scoring semantics if a non-multiply boostMode is configured in the future (e.g., sum would add +1 to all docs). Consider only adding the baseline function when boostMode is multiply (and only when there is at least one other boost function to apply), otherwise return the base query unchanged.

Copilot uses AI. Check for mistakes.
Comment on lines +561 to +565
// Add baseline weight of 1.0 so that assets with no tier/usage retain their text score
// when boostMode is multiply. Without this, function_score could be 0 and zero out the
// text relevance score.
functions.add(
OpenSearchQueryBuilder.weightFunction(OpenSearchQueryBuilder.matchAllQuery(), 1.0));
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

collectBoostFunctionsV2 now always adds a match_all weight(1.0) function, which means applyFunctionScoringV2 will always wrap the base query in a function_score query even when there are no configured term/field boosts. This is an observable behavior change (different query shape) and adds overhead; it also changes scoring semantics if a non-multiply boostMode is configured in the future (e.g., sum would add +1 to all docs). Consider only adding the baseline function when boostMode is multiply (and only when there is at least one other boost function to apply), otherwise return the base query unchanged.

Copilot uses AI. Check for mistakes.
Comment on lines 107 to 112
{
"field": "usageSummary.weeklyStats.count",
"factor": 4.0,
"modifier": "sqrt",
"factor": 0.002,
"modifier": "log1p",
"missing": 1
},
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The weekly usage boost uses missing: 1 with modifier: log1p, which produces a non-zero boost for documents that don’t have usageSummary.weeklyStats.count (log1p(1)≈0.693). That contradicts the stated intent that assets with no usage retain their original text score under multiplicative scoring. If the goal is “no usage == no boost”, set missing to 0 (the new baseline weight already prevents multiplying by 0).

Copilot uses AI. Check for mistakes.
Comment on lines 182 to 187
compositeConfig.setSearchFields(new ArrayList<>(uniqueFields.values()));
compositeConfig.setTermBoosts(allTermBoosts);
compositeConfig.setFieldValueBoosts(allFieldValueBoosts);
compositeConfig.setScoreMode(AssetTypeConfiguration.ScoreMode.SUM);
compositeConfig.setBoostMode(AssetTypeConfiguration.BoostMode.SUM);
compositeConfig.setBoostMode(AssetTypeConfiguration.BoostMode.MULTIPLY);

Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are existing unit tests for the search source builder factories, but none assert the new scoring semantics (boostMode multiply) or the presence/placement of the baseline weight function. Adding a targeted test that inspects the generated function_score query (boost_mode and first function == match_all weight 1.0) would help prevent regressions in ranking behavior across Elasticsearch/OpenSearch implementations.

Copilot uses AI. Check for mistakes.
Comment on lines 205 to 209
compositeConfig.setTermBoosts(allTermBoosts);
compositeConfig.setFieldValueBoosts(allFieldValueBoosts);
compositeConfig.setScoreMode(AssetTypeConfiguration.ScoreMode.SUM);
compositeConfig.setBoostMode(AssetTypeConfiguration.BoostMode.SUM);
compositeConfig.setBoostMode(AssetTypeConfiguration.BoostMode.MULTIPLY);

Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are existing unit tests for the search source builder factories, but none assert the new scoring semantics (boostMode multiply) or the presence/placement of the baseline weight function. Adding a targeted test that inspects the generated function_score query (boost_mode and first function == match_all weight 1.0) would help prevent regressions in ranking behavior across Elasticsearch/OpenSearch implementations.

Copilot uses AI. Check for mistakes.
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud bot commented Apr 1, 2026

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 1, 2026

🔴 Playwright Results — 2 failure(s), 25 flaky

✅ 3440 passed · ❌ 2 failed · 🟡 25 flaky · ⏭️ 223 skipped

Shard Passed Failed Flaky Skipped
🔴 Shard 1 450 1 4 2
🟡 Shard 2 612 0 7 32
🟡 Shard 3 617 0 3 27
🟡 Shard 4 619 0 5 47
🟡 Shard 5 586 0 1 67
🔴 Shard 6 556 1 5 48

Genuine Failures (failed on all attempts)

Pages/SearchSettings.spec.ts › Restore default search settings (shard 1)
Error: �[2mexpect(�[22m�[31mreceived�[39m�[2m).�[22mtoEqual�[2m(�[22m�[32mexpected�[39m�[2m) // deep equality�[22m

�[32m- Expected  - 3�[39m
�[31m+ Received  + 3�[39m

�[33m@@ -12,20 +12,20 @@�[39m
�[2m        "script": "",�[22m
�[2m        "type": "terms",�[22m
�[2m      },�[22m
�[2m    ],�[22m
�[2m    "assetType": "table",�[22m
�[32m-   "boostMode": "sum",�[39m
�[31m+   "boostMode": "multiply",�[39m
�[2m    "fieldValueBoosts": Array [�[22m
�[2m      Object {�[22m
�[32m-       "factor": 3,�[39m
�[31m+       "factor": 0.002,�[39m
�[2m        "field": "usageSummary.monthlyStats.count",�[22m
�[2m        "missing": 0,�[22m
�[2m        "modifier": "log1p",�[22m
�[2m      },�[22m
�[2m      Object {�[22m
�[32m-       "factor": 1,�[39m
�[31m+       "factor": 0.0005,�[39m
�[2m        "field": "usageSummary.monthlyStats.percentileRank",�[22m
�[2m        "missing": 0,�[22m
�[2m        "modifier": "none",�[22m
�[2m      },�[22m
�[2m    ],�[22m
Pages/Glossary.spec.ts › Add and Remove Assets (shard 6)
�[31mTest timeout of 180000ms exceeded.�[39m
🟡 25 flaky test(s) (passed on retry)
  • Features/CustomizeDetailPage.spec.ts › Table - customization should work (shard 1, 1 retry)
  • Flow/Metric.spec.ts › Verify Related Metrics Update (shard 1, 1 retry)
  • Flow/Tour.spec.ts › Tour should work from URL directly (shard 1, 1 retry)
  • Pages/AuditLogs.spec.ts › should handle audit logs access for non-admin users (shard 1, 1 retry)
  • Features/BulkEditEntity.spec.ts › Glossary (shard 2, 1 retry)
  • Features/DataProductRenameConsolidation.spec.ts › Multiple rename + update cycles - assets should be preserved (shard 2, 1 retry)
  • Features/DataQuality/BundleSuiteBulkOperations.spec.ts › Bulk selection operations (shard 2, 1 retry)
  • Features/DataQuality/ColumnLevelTests.spec.ts › Column Values Sum To Be Between (shard 2, 1 retry)
  • Features/DataQuality/DataQuality.spec.ts › Table test case (shard 2, 1 retry)
  • Features/DataQuality/DataQualityPermissions.spec.ts › User with TEST_CASE.VIEW_BASIC can view test case CONTENT details in UI (shard 2, 1 retry)
  • Features/Glossary/GlossaryAdvancedOperations.spec.ts › should remove individual reference from term (shard 2, 1 retry)
  • Features/Permissions/GlossaryPermissions.spec.ts › Team-based permissions work correctly (shard 3, 1 retry)
  • Flow/ExploreDiscovery.spec.ts › Should display deleted assets when showDeleted is checked and deleted is not present in queryFilter (shard 3, 1 retry)
  • Flow/NotificationAlerts.spec.ts › Conversation source alert (shard 3, 1 retry)
  • Pages/Customproperties-part2.spec.ts › entityReferenceList shows item count, scrollable list, no expand toggle (shard 4, 1 retry)
  • Pages/Domains.spec.ts › Rename domain with deeply nested subdomains (3+ levels) verifies FQN propagation (shard 4, 1 retry)
  • Pages/Entity.spec.ts › Delete Spreadsheet (shard 4, 1 retry)
  • Pages/Entity.spec.ts › Set & Update table-cp, hyperlink-cp, string, integer, markdown, number, duration, email, enum, sqlQuery, timestamp, entityReference, entityReferenceList, timeInterval, time-cp, date-cp, dateTime-cp Custom Property (shard 4, 1 retry)
  • Pages/Entity.spec.ts › Tier Add, Update and Remove (shard 4, 1 retry)
  • Pages/EntityDataConsumer.spec.ts › Tag Add, Update and Remove (shard 5, 1 retry)
  • Pages/ExploreTree.spec.ts › Verify Database and Database Schema available in explore tree (shard 6, 1 retry)
  • Pages/InputOutputPorts.spec.ts › Lineage section collapse/expand (shard 6, 1 retry)
  • Pages/Users.spec.ts › Permissions for table details page for Data Consumer (shard 6, 1 retry)
  • Pages/Users.spec.ts › Check permissions for Data Steward (shard 6, 1 retry)
  • VersionPages/EntityVersionPages.spec.ts › Directory (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend safe to test Add this label to run secure Github workflows on PRs

Projects

Status: In Review / QA 👀

Development

Successfully merging this pull request may close these issues.

2 participants