Skip to content

Conversation

@bowenlan-amzn
Copy link
Member

@bowenlan-amzn bowenlan-amzn commented Jan 30, 2026

Description

Streaming agg planning fallback when

  • low cardinality string term
  • match all + not match segment with deleted docs

query in the search context seems won't be null and if so it will be parsed to match all

parsedQuery(ParsedQuery.parsedMatchAllQuery());

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 30, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

  • 🔍 Trigger a full review
📝 Walkthrough

Walkthrough

This PR refines streaming aggregation optimization for Terms aggregations with ordinal-based string sources by adding specialized cost estimation logic that analyzes segment metadata to determine when streaming should be disabled based on cardinality and match-all query detection, along with corresponding integration and unit tests.

Changes

Cohort / File(s) Summary
Streaming Integration Tests
plugins/arrow-flight-rpc/src/internalClusterTest/java/org/opensearch/streaming/aggregation/SubAggregationIT.java
Updated multiple test methods to use existsQuery("field") constraints in stream search queries to prevent match-all optimization from disabling streaming behavior.
Terms Aggregator Streaming Cost Logic
server/src/main/java/org/opensearch/search/aggregations/bucket/terms/TermsAggregatorFactory.java
Added specialized streaming cost estimation for ordinal-based terms via new shouldDisableStreamingForOrdinals helper that inspects per-segment cardinality and deletion status. Introduced isMatchAllQuery detection to apply match-all optimization heuristics. Refactored conditional logic to split ordinal handling from numeric source path while preserving existing numeric behavior.
Streaming Cost Estimation Tests
server/src/test/java/org/opensearch/search/aggregations/FactoryStreamingCostEstimationTests.java
Added comprehensive test coverage for Terms aggregation streaming fallback scenarios including low cardinality, match-all queries with clean/deleted segments, and non-match-all constraints. Includes helper methods for building aggregators with custom queries and simulating various segment conditions.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • #20471: Modifies TermsAggregatorFactory's streaming cost/eligibility logic with overlapping changes to ordinal-specific heuristics and match-all query handling.

Suggested labels

Search:Aggregations, Search:Performance

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 62.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Streaming agg planning fallback enhancement' is relevant to the main changes, which implement fallback enhancements for streaming aggregation planning in specific scenarios.
Description check ✅ Passed The description addresses the key technical change (fallback for low cardinality strings and match-all scenarios), provides a relevant code reference, and confirms testing and checklist items are completed.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@bowenlan-amzn bowenlan-amzn added skip-changelog v3.5.0 Issues and PRs related to version 3.4.0 labels Jan 30, 2026
- low cardinality string term
- match all + not match segment with deleted docs

Signed-off-by: bowenlan-amzn <[email protected]>
@github-actions
Copy link
Contributor

❌ Gradle check result for be5a04e: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@rishabhmaurya
Copy link
Contributor

when explicit execution_hint is passed as map for term aggs, we can skip streaming agg.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In
`@server/src/main/java/org/opensearch/search/aggregations/bucket/terms/TermsAggregatorFactory.java`:
- Around line 717-728: The streaming-cost logic in TermsAggregatorFactory
(estimateStreamingCost / the block handling valuesSource) ignores an explicit
executionHint, causing streamable costs despite user-specified hints; update the
method to short-circuit to StreamingCostMetrics.nonStreamable() when an
executionHint is present (e.g., executionHint != null &&
!executionHint.isEmpty() or matches known hints like "map"/"global_ordinals")
before the WithOrdinals / Numeric checks, so honor the execution_hint and avoid
returning streamable costs when the user requested a specific execution mode.

In
`@server/src/test/java/org/opensearch/search/aggregations/FactoryStreamingCostEstimationTests.java`:
- Around line 684-704: The BooleanQuery wrapper in createAggregatorFactory can
rewrite back to MatchAllDocsQuery; replace the nonMatchAllQuery construction so
it cannot be rewritten (e.g., use a TermQuery on a stable constant/indexed field
or a ConstantScoreQuery wrapping a TermQuery) instead of new
BooleanQuery.Builder().add(new MatchAllDocsQuery(), MUST).build(), or after
creating nonMatchAllQuery
assert/searchContext.getQueryShardContext().rewrite(...) does not produce a
MatchAllDocsQuery to ensure the query stays non-match-all; update references to
nonMatchAllQuery and the createAggregatorFactory method accordingly.
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 732c20f and 297a70e.

📒 Files selected for processing (3)
  • plugins/arrow-flight-rpc/src/internalClusterTest/java/org/opensearch/streaming/aggregation/SubAggregationIT.java
  • server/src/main/java/org/opensearch/search/aggregations/bucket/terms/TermsAggregatorFactory.java
  • server/src/test/java/org/opensearch/search/aggregations/FactoryStreamingCostEstimationTests.java
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2026-01-13T17:40:27.167Z
Learnt from: reta
Repo: opensearch-project/OpenSearch PR: 20411
File: server/src/main/java/org/opensearch/index/codec/CodecService.java:112-133
Timestamp: 2026-01-13T17:40:27.167Z
Learning: Avoid capturing or evaluating a supplier (e.g., this::defaultCodec) upfront when passing it to a registry during object construction. If registries may replace defaults during iteration (as in EnginePlugin.getAdditionalCodecs), pass the supplier itself and only resolve it at use time. This ensures dynamic behavior is preserved during initialization and prevents premature binding of defaults in codecs/registry setup. This pattern should apply to similar initialization paths in Java server code where registries may mutate defaults during construction.

Applied to files:

  • server/src/main/java/org/opensearch/search/aggregations/bucket/terms/TermsAggregatorFactory.java
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: gradle-check
🔇 Additional comments (10)
plugins/arrow-flight-rpc/src/internalClusterTest/java/org/opensearch/streaming/aggregation/SubAggregationIT.java (2)

50-50: Import update looks fine.

Static import aligns with the new query usage in the tests.


294-298: Verify that existsQuery("field1") does not rewrite to MatchAllDocsQuery for fully populated fields.

The test data shows all 90 documents contain field1, making it fully populated. Lucene's FieldExistsQuery can optimize to MatchAllDocsQuery in such cases (similar to the _id field behavior seen in IdFieldMapper). If the query rewrites to MatchAllDocsQuery, it may disable streaming and make this assertion brittle. Confirm the rewritten query remains FieldExistsQuery (or equivalent non-match-all); if it collapses, consider using a query that cannot rewrite to match-all, such as a range query on the field or a term query with a known value.

server/src/main/java/org/opensearch/search/aggregations/bucket/terms/TermsAggregatorFactory.java (3)

35-40: Imports for the new per‑segment logic look good.


69-70: No review comment for this minor import tweak.


731-793: Helper logic is clear and targeted.

The low‑cardinality and clean‑segment checks are easy to follow and scoped to ordinals only.

server/src/test/java/org/opensearch/search/aggregations/FactoryStreamingCostEstimationTests.java (5)

18-24: No issues in the added imports.


43-45: No review comment for this minor static import tweak.


529-665: Great coverage of streaming fallback scenarios.

These tests cover low cardinality, match‑all with clean/deleted segments, and non‑match‑all paths well.


709-719: Helper for explicit query injection looks good.


721-732: Helper for min‑bucket‑count control is clear.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Comment on lines +717 to 728
if (valuesSource instanceof WithOrdinals ordinalsValuesSource) {
if (shouldDisableStreamingForOrdinals(searchContext, ordinalsValuesSource)) {
return StreamingCostMetrics.nonStreamable();
}
return new StreamingCostMetrics(true, segmentTopN);
}

if (valuesSource instanceof ValuesSource.Numeric) {
return new StreamingCostMetrics(true, segmentTopN);
}

return StreamingCostMetrics.nonStreamable();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Honor explicit execution_hint in streaming cost estimation.

estimateStreamingCost can still return streamable even when executionHint is explicitly set (e.g., map/global_ordinals), but the factory will then force a non‑streaming execution mode. This can make the planner pick streaming despite user intent. Consider short‑circuiting to non‑streamable when a hint is present.

🛠️ Proposed fix
 public StreamingCostMetrics estimateStreamingCost(SearchContext searchContext) {
     ValuesSource valuesSource = config.getValuesSource();
     int segmentTopN = computeSegmentTopN(searchContext, bucketCountThresholds, order);
+
+    if (executionHint != null) {
+        return StreamingCostMetrics.nonStreamable();
+    }
🤖 Prompt for AI Agents
In
`@server/src/main/java/org/opensearch/search/aggregations/bucket/terms/TermsAggregatorFactory.java`
around lines 717 - 728, The streaming-cost logic in TermsAggregatorFactory
(estimateStreamingCost / the block handling valuesSource) ignores an explicit
executionHint, causing streamable costs despite user-specified hints; update the
method to short-circuit to StreamingCostMetrics.nonStreamable() when an
executionHint is present (e.g., executionHint != null &&
!executionHint.isEmpty() or matches known hints like "map"/"global_ordinals")
before the WithOrdinals / Numeric checks, so honor the execution_hint and avoid
returning streamable costs when the user requested a specific execution mode.

@github-actions
Copy link
Contributor

❌ Gradle check result for 297a70e: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

❌ Gradle check result for bf5b7e6: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: bowenlan-amzn <[email protected]>
@github-actions
Copy link
Contributor

❌ Gradle check result for bcba3d5: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

skip-changelog v3.5.0 Issues and PRs related to version 3.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants