Skip to content

Add hybrid-rrf pipeline check and improve embedding error reporting#26936

Open
pmbrull wants to merge 6 commits intomainfrom
task/fix-status-validation-in-systemrepositor-c5927228
Open

Add hybrid-rrf pipeline check and improve embedding error reporting#26936
pmbrull wants to merge 6 commits intomainfrom
task/fix-status-validation-in-systemrepositor-c5927228

Conversation

@pmbrull
Copy link
Copy Markdown
Collaborator

@pmbrull pmbrull commented Apr 1, 2026

Summary

  • Adds validation of the hybrid-rrf search pipeline in the system health check when embeddings are enabled. If the pipeline is missing, the status reports an actionable error ("Run a reindex to create it"), differentiating 404 from 5xx/connectivity errors.
  • Improves error reporting when the vector search service fails to initialize: reads the stored init error to surface the actual failure (embedding client config vs OpenSearch vector service), replacing the generic "not configured properly" message.
  • Adds checkHybridSearchPipeline() to OpenSearchVectorService and SearchRepository, returning Optional<String> (empty = OK, present = error detail) to check pipeline existence via a GET request.
  • Skips expensive retry of initializeVectorSearchService() when a previous init error is already recorded.

Closes https://github.com/open-metadata/ai-platform/issues/473

Test plan

  • Unit tests added for checkHybridSearchPipeline (200 OK, 404, 5xx, and exception cases)
  • Unit tests added for SystemRepository embeddings validation paths (7 tests covering all branches)
  • Verify health check shows hybrid-rrf pipeline status when embeddings are enabled
  • Verify improved error messages when embedding client fails to initialize
  • Verify improved error messages when OpenSearch vector service fails
  • Run mvn spotless:apply - confirmed formatting is clean

…n status validation

- Add isHybridSearchPipelineAvailable() to OpenSearchVectorService and SearchRepository
  to verify the hybrid-rrf search pipeline exists in OpenSearch
- Update getEmbeddingsValidation in SystemRepository to also validate the pipeline
  when embeddings are enabled, failing with actionable message if missing
- Retry vector service initialization during validation to surface the actual
  failure reason (embedding client vs OpenSearch vector service) instead of
  the generic "not configured properly" message

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 1, 2026 13:08
@github-actions github-actions bot added Ingestion safe to test Add this label to run secure Github workflows on PRs labels Apr 1, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances semantic search health validation for OpenSearch-backed embeddings by checking for the required hybrid-rrf search pipeline and by improving initialization error reporting to better distinguish embedding client vs. vector service failures.

Changes:

  • Add a hybrid-rrf pipeline existence check (via GET /_search/pipeline/hybrid-rrf) and surface an actionable “reindex to create it” message when missing.
  • Update embeddings system validation to retry vector service initialization during health checks and produce more specific failure messages.
  • Add unit tests for OpenSearchVectorService.isHybridSearchPipelineAvailable() covering success, 404, and exception cases.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/SystemRepository.java Extends embeddings health validation to retry init and validate hybrid pipeline availability with improved messaging.
openmetadata-service/src/main/java/org/openmetadata/service/search/SearchRepository.java Exposes isHybridSearchPipelineAvailable() by delegating to OpenSearchVectorService when applicable.
openmetadata-service/src/main/java/org/openmetadata/service/search/vector/OpenSearchVectorService.java Adds isHybridSearchPipelineAvailable() that checks the hybrid-rrf search pipeline via the OpenSearch generic client.
openmetadata-service/src/test/java/org/openmetadata/service/search/vector/OpenSearchVectorServiceTest.java Adds unit tests for the new pipeline-availability method.

Comment on lines 578 to +597
if (searchRepository.getVectorIndexService() == null) {
return retryInitAndReportError(
searchRepository, embeddingsValidation, description, configMessage);
}

try {
StepValidation embeddingResult =
validateEmbeddingGeneration(
searchRepository.getEmbeddingClient(),
embeddingsValidation,
description,
configMessage);

if (Boolean.FALSE.equals(embeddingResult.getPassed())) {
return embeddingResult;
}

return validateHybridSearchPipeline(
searchRepository, embeddingResult, description, configMessage);
} catch (Exception e) {
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The embeddings health-check logic now includes additional branches (retrying initializeVectorSearchService(), differentiating embedding-client vs vector-service initialization, and validating hybrid-rrf pipeline presence). There don’t appear to be unit tests covering these new paths; adding tests for (1) pipeline exists, (2) pipeline missing (404), and (3) pipeline check failure (exception/401/5xx) would help prevent regressions in health-check reporting.

Copilot uses AI. Check for mistakes.
- Fix bug: retry path now validates embedding generation before checking
  the hybrid pipeline (gitar)
- Return Optional<String> from checkHybridSearchPipeline instead of
  boolean, differentiating 404 (missing) from 5xx/connectivity errors
  with specific messages (copilot)
- Include initialization exception message in StepValidation so
  operators don't need to hunt logs (copilot)
- Return Optional.empty() for non-OpenSearch implementations since
  hybrid pipeline is OpenSearch-specific (gitar)
- Add test for 5xx response case in pipeline check

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
initializeVectorSearchService() catches exceptions internally and never
rethrows, so the try-catch in retryInitAndReportError never captured the
error. Store the exception message in vectorServiceInitError field and
read it in the validation to include in the health check message.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 1, 2026 15:05
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

@Getter private EmbeddingClient embeddingClient;
@Getter private VectorIndexService vectorIndexService;
@Getter private VectorEmbeddingHandler vectorEmbeddingHandler;
@Getter private String vectorServiceInitError;
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vectorServiceInitError is written inside a synchronized method (initializeVectorSearchService) but read via an unsynchronized getter. Without volatile (or synchronized access), other threads may not observe the updated value, which can lead to stale/missing error details in validations/logs. Consider making this field volatile (and optionally clearing it on successful init to avoid stale errors).

Suggested change
@Getter private String vectorServiceInitError;
@Getter private volatile String vectorServiceInitError;

Copilot uses AI. Check for mistakes.
StepValidation embeddingsValidation,
String description,
String configMessage) {
searchRepository.initializeVectorSearchService();
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

retryInitAndReportError() triggers initializeVectorSearchService() as part of the system validation endpoint. When initialization keeps failing (e.g., misconfigured embedding provider), repeated calls to /system/validate will keep retrying expensive init logic and may also perform side effects (e.g., pipeline creation/handler registration attempts), potentially spamming logs and impacting the embedding provider/OpenSearch. Consider throttling retries (e.g., retry once per process lifetime or with a backoff timestamp) and reporting the last captured init error instead of re-initializing on every validation call.

Suggested change
searchRepository.initializeVectorSearchService();
// Avoid repeatedly triggering expensive vector search initialization from the
// validation endpoint when we already know initialization has failed.
String previousInitError = searchRepository.getVectorServiceInitError();
if (previousInitError == null) {
// Either initialization has not been attempted yet or it has not recorded an error,
// so we allow a single retry attempt.
searchRepository.initializeVectorSearchService();
} else {
LOG.debug(
"Skipping vector search service initialization retry during system validation "
+ "because a previous attempt failed with error: {}",
previousInitError);
}

Copilot uses AI. Check for mistakes.
Comment on lines +490 to +495
public Optional<String> checkHybridSearchPipeline() {
if (vectorIndexService instanceof OpenSearchVectorService openSearchVectorService) {
return openSearchVectorService.checkHybridSearchPipeline();
}
return Optional.empty();
}
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description mentions adding isHybridSearchPipelineAvailable() on OpenSearchVectorService/SearchRepository, but the implementation adds checkHybridSearchPipeline() returning Optional<String> instead. To avoid confusion for reviewers and future maintainers, please either update the PR description (and any referenced docs/issues) or rename/adjust the method to match the intended API contract.

Copilot uses AI. Check for mistakes.
Cover the new validation branches: pipeline available, pipeline missing
(404), pipeline 5xx, embedding client init failure with error message,
vector service init failure with embedding client OK, retry with no
recovery, and Elasticsearch not supported.

Make getEmbeddingsValidation package-private for testability.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Prevents stale error state after a successful retry.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 1, 2026 15:52
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Comment on lines +563 to 564
StepValidation getEmbeddingsValidation(OpenMetadataApplicationConfig applicationConfig) {
StepValidation embeddingsValidation = new StepValidation();
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getEmbeddingsValidation was changed from private to package-private to enable tests. To avoid widening SystemRepository's surface area, consider keeping it private and testing via the public validateSystem() path, or annotate the method with @VisibleForTesting to document the intent.

Copilot uses AI. Check for mistakes.
cfg.getNaturalLanguageSearch().getEmbeddingProvider(),
embeddingClient.getDimension());
} catch (Exception e) {
this.vectorServiceInitError = e.getMessage();
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vectorServiceInitError is populated with e.getMessage(), which can be null for some exceptions, resulting in unhelpful validation output (e.g., "Error: null."). Consider storing e.toString() (or message + exception class) so the health check always has actionable details.

Suggested change
this.vectorServiceInitError = e.getMessage();
this.vectorServiceInitError = e.toString();

Copilot uses AI. Check for mistakes.
Comment on lines +135 to +138
} catch (Exception e) {
LOG.error("Failed to check hybrid search pipeline '{}'", HYBRID_PIPELINE_NAME, e);
return Optional.of("Failed to check hybrid search pipeline: " + e.getMessage());
}
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error returned on exception uses e.getMessage(), which can be null and produce an unhelpful message. Consider including e.toString() (or exception class + message) so the pipeline check failure is always informative.

Copilot uses AI. Check for mistakes.
when(appConfig.getElasticSearchConfiguration()).thenReturn(esConfig);
when(esConfig.getNaturalLanguageSearch()).thenReturn(nlpConfig);
when(nlpConfig.getEmbeddingProvider()).thenReturn("bedrock");
when(nlpConfig.getBedrock()).thenReturn(null);
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test sets nlpConfig.getBedrock() to null while embeddingProvider is "bedrock". This triggers an exception path in getEmbeddingConfigurationMessage() (and an error-level log) on every test run, which can add noise and reduce test signal. Prefer mocking a minimal Bedrock config object (or using a provider whose config is stubbed) to keep tests clean and representative.

Suggested change
when(nlpConfig.getBedrock()).thenReturn(null);
NaturalLanguageSearchConfiguration.BedrockConfiguration bedrockConfig =
mock(NaturalLanguageSearchConfiguration.BedrockConfiguration.class);
when(nlpConfig.getBedrock()).thenReturn(bedrockConfig);

Copilot uses AI. Check for mistakes.
- Add volatile to vectorServiceInitError for thread-safe reads from
  the unsynchronized getter
- Skip calling initializeVectorSearchService() during validation when
  a previous init error is already recorded, avoiding expensive retries
  and log spam on every /system/validate call

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@gitar-bot
Copy link
Copy Markdown

gitar-bot bot commented Apr 1, 2026

Code Review ✅ Approved 4 resolved / 4 findings

Adds hybrid-rrf pipeline check and improves embedding error reporting in status validation by storing vector service init errors and clearing them on successful retry, with unit tests covering the validation paths. All findings resolved including retry path validation and cross-backend compatibility.

✅ 4 resolved
Bug: Retry path skips embedding generation validation

📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/SystemRepository.java:617-622
When the vector service is initially null and retryInitAndReportError successfully re-initializes it (line 617), the code jumps directly to validateHybridSearchPipeline without first calling validateEmbeddingGeneration. The happy path (lines 583-596) validates embedding generation first (dimension mismatch, all-zeros, null checks) before checking the hybrid pipeline. This means a misconfigured embedding client (e.g., wrong model dimensions) could pass validation after a retry, reporting "Embeddings and hybrid search pipeline are working correctly" when embeddings are actually broken.

Quality: isHybridSearchPipelineAvailable returns false for non-OpenSearch

📄 openmetadata-service/src/main/java/org/openmetadata/service/search/SearchRepository.java:487-492
In SearchRepository.isHybridSearchPipelineAvailable(), the method returns false when vectorIndexService is not an OpenSearchVectorService instance. This means the health check will always report the hybrid pipeline as missing for any future non-OpenSearch vector service implementations, causing a false-negative validation failure. Consider returning true (not applicable) for non-OpenSearch implementations, or making isHybridSearchPipelineAvailable() part of the VectorIndexService interface with a default true implementation.

Quality: Success message mentions hybrid pipeline for non-OpenSearch backends

📄 openmetadata-service/src/main/java/org/openmetadata/service/search/SearchRepository.java:490-495 📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/SystemRepository.java:676-681
SearchRepository.checkHybridSearchPipeline() returns Optional.empty() when the vector service is not OpenSearch, which causes validateHybridSearchPipeline to report "Embeddings and hybrid search pipeline are working correctly" even for backends that don't have such a pipeline. This is harmless today since only OpenSearch is supported, but could mislead operators if other backends are added.

Bug: vectorServiceInitError not cleared on successful retry

📄 openmetadata-service/src/main/java/org/openmetadata/service/search/SearchRepository.java:439 📄 openmetadata-service/src/main/java/org/openmetadata/service/search/SearchRepository.java:448
When initializeVectorSearchService() is called and fails, vectorServiceInitError is set to the exception message. On a subsequent retry that succeeds, vectorServiceInitialized is set to true but vectorServiceInitError is never cleared back to null. While this doesn't cause a bug in the current code paths (because the error is only read when getVectorIndexService() == null), it's a latent defect: any future code that reads vectorServiceInitError after a successful recovery will see a stale error message, which could lead to confusing diagnostics.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud bot commented Apr 1, 2026

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 1, 2026

🟡 Playwright Results — all passed (24 flaky)

✅ 3443 passed · ❌ 0 failed · 🟡 24 flaky · ⏭️ 223 skipped

Shard Passed Failed Flaky Skipped
🟡 Shard 1 453 0 2 2
🟡 Shard 2 616 0 3 32
🟡 Shard 3 616 0 4 27
🟡 Shard 4 616 0 8 47
🟡 Shard 5 585 0 2 67
🟡 Shard 6 557 0 5 48
🟡 24 flaky test(s) (passed on retry)
  • Pages/AuditLogs.spec.ts › should apply both User and EntityType filters simultaneously (shard 1, 1 retry)
  • Pages/SearchIndexApplication.spec.ts › Search Index Application (shard 1, 1 retry)
  • Features/BulkEditEntity.spec.ts › Glossary (shard 2, 1 retry)
  • Features/DataQuality/BundleSuiteBulkOperations.spec.ts › Bulk selection operations (shard 2, 1 retry)
  • Features/DataQuality/TableLevelTests.spec.ts › Table Difference (shard 2, 1 retry)
  • Features/Permissions/GlossaryPermissions.spec.ts › Team-based permissions work correctly (shard 3, 1 retry)
  • Features/RestoreEntityInheritedFields.spec.ts › Validate restore with Inherited domain and data products assigned (shard 3, 1 retry)
  • Flow/ExploreDiscovery.spec.ts › Should display deleted assets when showDeleted is checked and deleted is not present in queryFilter (shard 3, 1 retry)
  • Flow/ServiceForm.spec.ts › Verify form selects are working properly (shard 3, 1 retry)
  • Pages/Customproperties-part2.spec.ts › entityReferenceList shows item count, scrollable list, no expand toggle (shard 4, 1 retry)
  • Pages/DataContracts.spec.ts › Create Data Contract and validate for Chart (shard 4, 1 retry)
  • Pages/DataContracts.spec.ts › Create Data Contract and validate for File (shard 4, 1 retry)
  • Pages/DataContractsSemanticRules.spec.ts › Validate Owner Rule Is_Set (shard 4, 1 retry)
  • Pages/DataContractsSemanticRules.spec.ts › Validate UpdatedOn Rule Greater Than Equal (shard 4, 1 retry)
  • Pages/Domains.spec.ts › Rename domain with subdomains attached verifies subdomain accessibility (shard 4, 1 retry)
  • Pages/Domains.spec.ts › Rename domain with data products attached at domain and subdomain levels (shard 4, 1 retry)
  • Pages/Domains.spec.ts › Multiple consecutive domain renames preserve all associations (shard 4, 1 retry)
  • Pages/EntityDataConsumer.spec.ts › Glossary Term Add, Update and Remove (shard 5, 1 retry)
  • Pages/ExplorePageRightPanel.spec.ts › Should perform CRUD and Removal operations for table (shard 5, 1 retry)
  • Pages/InputOutputPorts.spec.ts › Input port button visible, output port button hidden when no assets (shard 6, 1 retry)
  • Pages/ODCSImportExport.spec.ts › Multi-object ODCS contract - object selector shows all schema objects (shard 6, 1 retry)
  • Pages/Users.spec.ts › Permissions for table details page for Data Consumer (shard 6, 1 retry)
  • Pages/Users.spec.ts › Check permissions for Data Steward (shard 6, 1 retry)
  • VersionPages/EntityVersionPages.spec.ts › Directory (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Ingestion safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants