Skip to content

Conversation

@atris
Copy link
Contributor

@atris atris commented Aug 28, 2025

Summary

Implement streaming search on the coordinator, emitting early partial results from the query phase with optional scoring. The change introduces request flags and a mode selector, integrates streaming into the existing SearchAction
path, and adds a reproducible TTFB benchmark. When streaming is not used, behavior is unchanged.

Motivation

• Reduce time-to-first-byte (TTFB) at the coordinator by not waiting for all shards to complete query phase before starting fetch-eligible work.
• Provide mode-specific controls for batching and scoring, with safe defaults.
• Keep backward compatibility on the transport wire and preserve REST semantics.

Design and Scope

• Request flags and mode
• SearchRequest gains version-gated fields (V_3_3_0): streamingScoring (boolean) and streamingSearchMode (string).
• REST: stream=true enables streaming; optional stream_scoring_mode and streaming_mode select behavior.
• No change to default behavior; streaming is opt‑in.
• Coordinator streaming
• Streaming is integrated into TransportSearchAction (SearchAction). No separate transport action is required.
• SearchPhaseController.newSearchPhaseResults(...) returns either the existing QueryPhaseResultConsumer or a StreamQueryPhaseResultConsumer based on the request mode.
• StreamQueryPhaseResultConsumer controls partial reduce cadence via mode-specific multipliers and emits TopDocs-aware partials to the progress listener.
• Partial reduce notifications
• SearchProgressListener gains a TopDocs-aware hook with a compatibility fallback:
• onPartialReduceWithTopDocs(…) → defaults to onPartialReduce(…).
• notifyPartialReduceWithTopDocs(…) invokes the hook safely.
• Existing listeners are unaffected.
• Query execution
• For streaming queries, the QueryPhase routes to streaming collector contexts based on StreamingSearchMode:
• NO_SCORING: unsorted documents, fastest emission.
• SCORED_UNSORTED: scored documents without sort.
• SCORED_SORTED: scored, sorted via Lucene’s top-N collectors.
• CONFIDENCE_BASED: early emission guided by simple Hoeffding-style bounds.
• Collector batch size is bounded and read via SearchContext.getStreamingBatchSize(); partial batches are emitted to the stream channel when available.
• Transport integration
• Both the classic and stream transport handlers are registered:
• Classic: SearchTransportService.registerRequestHandler(…).
• Stream (if available): StreamSearchTransportService.registerStreamRequestHandler(…).
• The streaming transport path is selected only for streaming requests and used thread pools are chosen accordingly.

Settings and Controls

• Dynamic cluster settings for streaming are added (StreamingSearchSettings, node-scoped, dynamic). Examples:
• search.streaming.batch_size
• Mode-specific reduce multipliers, emission interval, and minimal doc thresholds
• Circuit breaker and limits for buffering in streaming code paths
• Defaults are conservative. The feature remains opt-in via request flags; settings do not change behavior unless the request is streaming.

Wire Compatibility and API

• Transport wire BWC
• New SearchRequest and ShardSearchRequest fields are gated by Version.V_3_3_0 on read/write. Older peers neither write nor read these fields.
• Public API
• No breaking changes to REST endpoints.
• SearchProgressListener adds new methods with safe defaults; existing code continues to compile and run.

Tests and Benchmark

• Unit tests:
• Stream consumer batch sizing and dynamic settings effects.
• Hoeffding bounds behavior.
• Integration tests:
• Basic streaming search workflows.
• Streaming aggregations with and without sub-aggregations.
• Mode coverage (NO_SCORING, SCORED_UNSORTED, SCORED_SORTED, CONFIDENCE_BASED).
• Benchmark:
• StreamingPerformanceBenchmarkTests: measures coordinator-side TTFB (time to first partial reduce) vs. classic full reduce for a large query.
• Logger-only reporting; no REST streaming is introduced.

Non-Goals / Limitations

• This change does not implement HTTP/REST streaming of partial responses.
• The SearchResponse partial/sequence metadata used internally by the streaming listener is not serialized on the wire and does not alter REST payloads.
• Confidence-based mode uses a conservative and simple bound; it is adequate for early gating but not a full ranking stability analysis.

Backward Compatibility and Risk

• Default behavior unchanged unless streaming flags are provided.
• Wire BWC ensured via version gating; JApiCmp passes.
• Aggregation partial reductions are unaffected; for TopDocs partials we call the new TopDocs-aware hook, otherwise we continue to notify via the existing method.

Operational Notes

• Streaming is disabled by default and must be explicitly requested with stream=true (REST) or by setting SearchRequest flags programmatically.
• Mode selection allows tuning for latency vs. coordination cost.
• Dynamic settings enable safe runtime tuning if necessary.

If reviewers prefer, I can split the settings and the confidence-based collector into a follow-up to further reduce the initial surface.

Summary by CodeRabbit

  • New Features

    • Streaming search extended to delete-by-query flows and safer REST channel lifecycle management for streaming requests.
  • Bug Fixes

    • More reliable aggregation profiling type resolution when profiling wrappers are present.
    • Safer handling around post-collection aggregation building to avoid null/chain issues.
  • Tests

    • New unit and integration tests for streaming search, channel lifecycle/untracking, streaming aggregators, and streaming collector selection.

✏️ Tip: You can customize this high-level summary in your review settings.

@github-actions
Copy link
Contributor

❌ Gradle check result for 5554606: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@atris atris closed this Aug 29, 2025
@atris atris reopened this Aug 29, 2025
@github-actions
Copy link
Contributor

❌ Gradle check result for 5554606: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@atris atris closed this Aug 29, 2025
@atris atris reopened this Aug 29, 2025
@github-actions
Copy link
Contributor

❌ Gradle check result for 5554606: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

  Introduces streaming search infrastructure that enables progressive emission
  of search results with three configurable scoring modes. The implementation
  extends the existing streaming transport layer to support partial result
  computation at the coordinator level.

  Scoring modes:
  - NO_SCORING: Immediate result emission without confidence requirements
  - CONFIDENCE_BASED: Statistical emission using Hoeffding inequality bounds
  - FULL_SCORING: Complete scoring before result emission

  The implementation leverages OpenSearch's inter-node streaming capabilities
  to reduce query latency through early result emission. Partial reductions
  are triggered based on the selected scoring mode, with results accumulated
  at the coordinator before final response generation.

  Key changes:
  - Add HoeffdingBounds for statistical confidence calculation
  - Extend QueryPhaseResultConsumer to support streaming reduction
  - Add StreamingScoringCollector wrapping TopScoreDocCollector
  - Integrate streaming scorer selection in QueryPhase
  - Add REST parameter stream_scoring_mode for mode selection
  - Include streaming metadata in SearchResponse

  The current implementation operates within architectural constraints where
  streaming is limited to inter-node communication. Client-facing streaming
  will be addressed in a follow-up contribution.

  Addresses opensearch-project#18725

Signed-off-by: Atri Sharma <[email protected]>
@github-actions
Copy link
Contributor

❌ Gradle check result for 3e1079a: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@atris atris closed this Jan 21, 2026
@github-project-automation github-project-automation bot moved this from In Progress to Done in Performance Roadmap Jan 21, 2026
@atris atris reopened this Jan 21, 2026
@github-project-automation github-project-automation bot moved this from Done to In Progress in Performance Roadmap Jan 21, 2026
@github-actions
Copy link
Contributor

❌ Gradle check result for 3e1079a: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

❌ Gradle check result for 2fbd384: null

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@atris atris closed this Jan 22, 2026
@github-project-automation github-project-automation bot moved this from In Progress to Done in Performance Roadmap Jan 22, 2026
@atris atris reopened this Jan 22, 2026
@github-project-automation github-project-automation bot moved this from Done to In Progress in Performance Roadmap Jan 22, 2026
Signed-off-by: Atri Sharma <[email protected]>
@github-actions
Copy link
Contributor

❌ Gradle check result for edeabe5: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

❌ Gradle check result for c204273: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Atri Sharma <[email protected]>
@github-actions
Copy link
Contributor

❌ Gradle check result for 5c095cf: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Atri Sharma <[email protected]>
@github-actions
Copy link
Contributor

❌ Gradle check result for 6af9270: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@atris atris closed this Jan 23, 2026
@github-project-automation github-project-automation bot moved this from In Progress to Done in Performance Roadmap Jan 23, 2026
@atris atris reopened this Jan 23, 2026
@github-project-automation github-project-automation bot moved this from Done to In Progress in Performance Roadmap Jan 23, 2026
@github-actions
Copy link
Contributor

❌ Gradle check result for 6af9270: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

❌ Gradle check result for 87a3712: null

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

❌ Gradle check result for b2f1903: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

2 participants