Skip to content

SOLR-18255 Jans initial port from OSB to solr-benchmark (6 logical commits)#3

Merged
janhoy merged 6 commits into
apache:mainfrom
janhoy:port/apache-solr-benchmark
May 22, 2026
Merged

SOLR-18255 Jans initial port from OSB to solr-benchmark (6 logical commits)#3
janhoy merged 6 commits into
apache:mainfrom
janhoy:port/apache-solr-benchmark

Conversation

@janhoy
Copy link
Copy Markdown
Contributor

@janhoy janhoy commented May 21, 2026

https://issues.apache.org/jira/browse/SOLR-18255

This PR contains the initial port of OpenSearch Benchmark (OSB) to work with Apache Solr. The fork point from OSB is tagged osb_fork_point (OSB commit 92982c56).

The codebase retains the OSB Python package name (osbenchmark) and directory structure for now; known work to do is tracked in TODO.md and will likely be converted into JIRA tasks.

How to review

The PR is structured as 6 commits in logical progression order. Each commit is independently coherent and reviewable in isolation. The recommended approach is to review one commit at a time using GitHub's commit view or git log -p. The final commit is the largest, but by that point the project shape is established and the changes read more clearly in context.

# Commit Files What to focus on
1 Establish ASF legal and governance files 12 NOTICE attribution, license header format, CONTRIBUTING accuracy
2 Update GitHub/CI infrastructure 20 Workflow correctness, removed vs. kept actions
3 Rewrite documentation 84 Install steps, CLI examples, converter docs accuracy
4 Remove OSB-specific dead code and binaries 41 Verify nothing Solr-relevant was swept up
5 Add new Solr-specific modules 25 Conversion logic (schema.py, query.py), provisioner correctness
6 Port core benchmark framework 195 client.py, telemetry.py, runner.py — see functional notes below

Summary of major changes

1. Solr-native client (osbenchmark/client.py)

The OpenSearch Python client (opensearch-py) has been replaced with a purpose-built SolrAdminClient and SolrClient that communicate with Solr over HTTP using requests/pysolr. All collection management, document indexing, and query execution now goes through Solr's REST API (Collections API, /select, /update, etc.).

2. Solr provisioner (osbenchmark/builder/solr_provisioner.py)

A new SolrProvisioner replaces the OpenSearch node provisioning machinery. It supports three deployment modes:

  • from-distribution — downloads a released Solr binary from downloads.apache.org or the ASF archive (including pre-9.0 paths).
  • from-sources — builds Solr from a local checkout with Gradle.
  • docker — pulls and starts the official Solr Docker image, including nightly builds.

SolrDockerLauncher handles container lifecycle. Version-aware logic handles the API differences between Solr 9.x and 10.x (e.g. collection creation flags).

3. Solr-specific telemetry devices (osbenchmark/telemetry.py)

Six new SolrTelemetryDevice subclasses collect Solr-specific metrics during a run: SolrJvmStats, SolrNodeStats, SolrCollectionStats, SolrQueryStats, SolrIndexingStats, SolrCacheStats. These poll the Solr Metrics API and write results via the existing ResultWriter pipeline.

4. Solr runner operations (osbenchmark/worker_coordinator/runner.py)

56 OpenSearch-specific runner classes have been removed (KNN, ML connectors, vector datasets, data streams, index templates, pipelines, etc.). In their place, Solr-specific runners have been added under SolrRunner: SolrBulkIndex, SolrSearch, SolrPaginatedSearch, SolrCommit, SolrOptimize, SolrWaitForMerges, SolrCreateCollection, SolrDeleteCollection.

5. Workload model: index → collection (osbenchmark/workload/)

The workload domain model has been updated throughout:

  • Index / DataStream / IndexTemplateCollection
  • IndexTemplate, ComponentTemplate, DataStream and serverless/vector-related types removed
  • New CreateCollectionParamSource / DeleteCollectionParamSource / SolrSearchParamSource
  • OpenSearch Query DSL validation removed; Solr query params used instead

6. OSB-to-Solr workload converter (osbenchmark/conversion/)

A new converter pipeline (workload_converter.py, detector.py, query.py, schema.py, field.py) translates an OpenSearch Benchmark workload into Solr format:

  • Detects OSB-specific operations and query DSL automatically
  • Translates bulkbulk-index, force-mergeoptimize, index mappings → Solr configsets
  • Generates a minimal solrconfig.xml / managed-schema.xml configset skeleton
  • Invoked via solr-benchmark convert-workload; see docs/converter/ for details

7. Metrics store simplified (osbenchmark/metrics.py)

OsMetricsStore, OsTestRunStore, OsResultsStore, and IndexTemplateProvider (all backed by OpenSearch) have been removed. The single supported store is now FilesystemMetricsStore (JSON + CSV + SQLite on local disk), accessed via LocalFilesystemResultWriter.

8. Documentation site (docs/)

A full user-facing documentation site is included, built with Jekyll + just-the-docs. Key sections: user-guide/ (install, configure, workload authoring), reference/ (telemetry, metrics, workload schema, commands), converter/ (OSB migration guide), cluster-config/. Deployed to GitHub Pages via .github/workflows/docs.yml. See docs/README.md for local build instructions.

9. ASF licence headers and housekeeping

  • All modified files carry a two-line ASF modification notice above the original OpenSearch header.
  • OSB-specific GitHub workflows (release, backport, integ-test, PyPI publish) removed; a docs deploy workflow added.
  • Bundled pbzip2 binaries removed; pbzip2 is now an optional system prerequisite.
  • CONTRIBUTING.md, DEVELOPER_GUIDE.md, README.md rewritten for the Solr/ASF context.
  • TODO.md tracks remaining incubation steps (package rename, CI, release process, etc.).

The changes are described by the 9 functional areas above regardless of which commit they land in. The 6-commit structure exists purely to aid review — it does not reflect the order in which the work was done.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review this pull request because it exceeds the maximum number of files (300). Try reducing the number of changed files and requesting a review from Copilot again.

Copy link
Copy Markdown

@epugh epugh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I poked around, and other then noticing some solr-benchmark where I expected solr-orbit, this looks good. Maybe in a future pr we fix the directory names?

@janhoy
Copy link
Copy Markdown
Contributor Author

janhoy commented May 22, 2026

Copilot wasn't able to review this pull request because it exceeds the maximum number of files (300). Try reducing the number of changed files and requesting a review from Copilot again.

Haha :)

I poked around, and other then noticing some solr-benchmark where I expected solr-orbit, this looks good. Maybe in a future pr we fix the directory names?

Not yet re-branded this repo, so that is expected.

Being in flux and needing more steps, I'll do CTR for this PR and merge the 6 commits as is. Normally I'd leave it open for 3-4 days to allow reviews, but I believe in this early stage, as you say Eric, it is acceptable to focus on progress as long as we follow best practices and others can review after the fact.

janhoy added 6 commits May 22, 2026 02:29
Replace OpenSearch-specific project governance with ASF-compatible equivalents:
add NOTICE and create-notice.sh for ASF IP compliance, bump version to 0.1.0
to signal the fresh start, update CONTRIBUTING.md for Solr context, remove
MAINTAINERS/RELEASE/TRIAGE files that don't apply to an ASF incubating project
(those processes are defined by the ASF), drop .whitesource/.fossa.yml OSS-scanning
configs that were tied to the OpenSearch project infrastructure.

Part of apache#3
Adapt CI/CD and project metadata for the Solr port:
- Remove workflows that depended on OpenSearch infrastructure (backport, add-untriaged,
  integ-test, publish-release, docker-push-release) — these will be rebuilt once the
  project has its own ASF infrastructure
- Add docs.yml workflow (commented out pending docs host decision)
- Simplify unit-test and docker-build workflows to remove OpenSearch-specific steps
- Update .ci/build.sh and check_deprecated_terms.py for Solr naming
- Remove CODEOWNERS and issue templates tied to the old team structure
- Add AGENTS.md: guidance for AI coding assistants working in this repo
- Refresh Makefile, tox.ini, .pylintrc, .gitignore for the new project shape

Part of apache#3
Complete documentation overhaul for the Solr port:
- Replace OpenSearch-focused Jekyll docs site with Solr-specific content covering
  installation, configuration, running benchmarks, workload creation, and the
  new OpenSearch→Solr converter
- Remove legacy docs/api/ (OpenSearch API reference) and docs/user-guides/
  in favor of the new docs/user-guide/ and docs/reference/ structure
- Update README.md, DEVELOPER_GUIDE.md, PYTHON_SUPPORT_GUIDE.md,
  CREATE_WORKLOAD_GUIDE.md to use Solr terminology and solr-benchmark CLI
- Add TODO.md: incubation checklist and known remaining work
- Add it/README.md: integration test setup instructions
- Remove opensearch_benchmark.png splash image

Part of apache#3
Clean out everything that has no place in a Solr benchmark tool:

**Kafka / async HTTP / gRPC:**
- Remove kafka_client.py (Kafka producer for OpenSearch metrics streaming)
- Remove async_connection.py (OpenSearch async HTTP connection layer)
- Remove worker_coordinator/proto_helpers/ (gRPC bulk/query helpers)
- Remove osbenchmark/data_streaming/ package (Kafka data pipeline)
- Remove all corresponding unit tests

**Bundled binaries:**
- Remove osbenchmark/decompressors/pbzip2-{Darwin,Linux}-{arm64,x86_64,aarch64}
- Remove scripts/pbzip2
  These binaries are not redistributable in an ASF project; decompression will
  use the system pbzip2 or fallback to Python's bz2 module.

**OpenSearch-specific infrastructure:**
- Remove scripts/terraform/ (Terraform cluster provisioning for OpenSearch on AWS)
- Remove samples/ccr/ (OpenSearch cross-cluster replication sample)
- Remove tests for all of the above

Part of apache#3
All files in this commit are net-new — no existing code is modified.

**osbenchmark/conversion/ — OpenSearch→Solr workload converter:**
- detector.py: identify whether a workload targets OpenSearch or Solr
- field.py: field name normalization rules
- schema.py: translate OpenSearch index mappings to Solr schema XML
- query.py: translate OpenSearch Query DSL operations to Solr query syntax
- workload_converter.py: orchestrate full workload directory conversion
Tests: tests/unit/solr/conversion/ and tests/unit/solr/test_workload_converter.py

**osbenchmark/builder/solr_provisioner.py:**
Provision and configure a Solr cluster (collection creation, configset upload,
schema application) as a drop-in replacement for the OpenSearch provisioner.
Test: tests/unit/solr/test_provisioner.py

**osbenchmark/builder/installers/preparers/solr_preparer.py:**
Prepare a Solr node installation (derived from opensearch_preparer.py, adapted
for Solr directory layout and startup options).
Test: tests/builder/installers/preparers/solr_preparer_test.py

**osbenchmark/result_writer.py:**
Write benchmark results to filesystem in JSON/CSV formats for Solr runs.
Test: tests/unit/solr/test_result_writer.py

**solrbenchmark/ package:**
Thin top-level package and entry point (solr-benchmark CLI) that will replace
opensearch-benchmark once the project is accepted into ASF.

**tests/unit/solr/:** Full unit test suite for all new Solr modules.
tests/unit/test_telemetry.py: new telemetry test replacing the old telemetry_test.py.

None of these modules are wired into the main CLI yet; that happens in the
next commit.

Part of apache#3
This is the main functional change of the Solr port, touching all layers of the
benchmark tool. This commit wires the previous five together.

**setup.py / entry points:**
- Remove opensearch-py, opensearch-protobufs, aiokafka dependencies
- Add requests for Solr HTTP communication
- Rename entry points: opensearch-benchmark→solr-benchmark, osb→sb (pending ASF acceptance)

**osbenchmark/client.py — complete rewrite:**
Replace the opensearch-py async client with a synchronous requests-based Solr
HTTP client. Supports collection management, document indexing, and query
execution against a Solr cluster.

**osbenchmark/telemetry.py:**
Replace OpenSearch-specific telemetry devices (JVM heap, GC, hot threads, etc.)
with Solr equivalents using the Solr Metrics API and node stats endpoints.

**osbenchmark/worker_coordinator/runner.py:**
Adapt operation runners for Solr: bulk indexing via /update, queries via /select,
collection admin operations. Remove OpenSearch-specific operations (snapshot,
shrink, force-merge semantics, etc.).

**osbenchmark/builder/ — OSB→Solr naming cleanup:**
- Rename opensearch_distribution_downloader.py → distribution_downloader.py
- Rename opensearch_source_downloader.py → source_downloader.py
- Rename opensearch_distribution_repository_provider.py → distribution_repository_provider.py
- Delete opensearch_preparer.py (replaced by solr_preparer.py in PR 5)
- Delete core_plugin_source_downloader.py, external_plugin_source_downloader.py,
  plugin_distribution_downloader.py (OSB plugin infrastructure, not needed for Solr)
- Update builder.py, provisioner.py, supplier.py to use the new Solr provisioner

**osbenchmark/config.py, context.py, benchmark.py, benchmarkd.py:**
Update configuration keys, context variables, and CLI help text for Solr.
Remove OpenSearch-specific commands and flags; add Solr cluster URL handling.

**osbenchmark/workload/ and osbenchmark/workload_generator/:**
Adapt workload loading and workload generation for Solr collection schema.

**osbenchmark/metrics.py, publisher.py:**
Update metric names and summary report labels from OpenSearch to Solr terminology.

**osbenchmark/resources/ — cluster config cleanup:**
Remove resources/cluster_configs/1.0/ entirely (OpenSearch 1.x configs).
Simplify resources/cluster_configs/main/ to Solr-relevant entries.
Update benchmark.ini default configuration.

**Integration and unit tests:**
Update all existing tests to match new API shapes. Delete tests for removed
functionality (telemetry_test.py replaced by tests/unit/test_telemetry.py,
workload_generator corpus/index tests removed as workload_generator was refactored).

Part of apache#3
@janhoy janhoy force-pushed the port/apache-solr-benchmark branch from 964651b to 5559b28 Compare May 22, 2026 00:34
@janhoy janhoy merged commit 57387ed into apache:main May 22, 2026
3 checks passed
janhoy added a commit that referenced this pull request May 22, 2026
Replace OpenSearch-specific project governance with ASF-compatible equivalents:
add NOTICE and create-notice.sh for ASF IP compliance, bump version to 0.1.0
to signal the fresh start, update CONTRIBUTING.md for Solr context, remove
MAINTAINERS/RELEASE/TRIAGE files that don't apply to an ASF incubating project
(those processes are defined by the ASF), drop .whitesource/.fossa.yml OSS-scanning
configs that were tied to the OpenSearch project infrastructure.

Part of #3
janhoy added a commit that referenced this pull request May 22, 2026
Adapt CI/CD and project metadata for the Solr port:
- Remove workflows that depended on OpenSearch infrastructure (backport, add-untriaged,
  integ-test, publish-release, docker-push-release) — these will be rebuilt once the
  project has its own ASF infrastructure
- Add docs.yml workflow (commented out pending docs host decision)
- Simplify unit-test and docker-build workflows to remove OpenSearch-specific steps
- Update .ci/build.sh and check_deprecated_terms.py for Solr naming
- Remove CODEOWNERS and issue templates tied to the old team structure
- Add AGENTS.md: guidance for AI coding assistants working in this repo
- Refresh Makefile, tox.ini, .pylintrc, .gitignore for the new project shape

Part of #3
janhoy added a commit that referenced this pull request May 22, 2026
Complete documentation overhaul for the Solr port:
- Replace OpenSearch-focused Jekyll docs site with Solr-specific content covering
  installation, configuration, running benchmarks, workload creation, and the
  new OpenSearch→Solr converter
- Remove legacy docs/api/ (OpenSearch API reference) and docs/user-guides/
  in favor of the new docs/user-guide/ and docs/reference/ structure
- Update README.md, DEVELOPER_GUIDE.md, PYTHON_SUPPORT_GUIDE.md,
  CREATE_WORKLOAD_GUIDE.md to use Solr terminology and solr-benchmark CLI
- Add TODO.md: incubation checklist and known remaining work
- Add it/README.md: integration test setup instructions
- Remove opensearch_benchmark.png splash image

Part of #3
janhoy added a commit that referenced this pull request May 22, 2026
Clean out everything that has no place in a Solr benchmark tool:

**Kafka / async HTTP / gRPC:**
- Remove kafka_client.py (Kafka producer for OpenSearch metrics streaming)
- Remove async_connection.py (OpenSearch async HTTP connection layer)
- Remove worker_coordinator/proto_helpers/ (gRPC bulk/query helpers)
- Remove osbenchmark/data_streaming/ package (Kafka data pipeline)
- Remove all corresponding unit tests

**Bundled binaries:**
- Remove osbenchmark/decompressors/pbzip2-{Darwin,Linux}-{arm64,x86_64,aarch64}
- Remove scripts/pbzip2
  These binaries are not redistributable in an ASF project; decompression will
  use the system pbzip2 or fallback to Python's bz2 module.

**OpenSearch-specific infrastructure:**
- Remove scripts/terraform/ (Terraform cluster provisioning for OpenSearch on AWS)
- Remove samples/ccr/ (OpenSearch cross-cluster replication sample)
- Remove tests for all of the above

Part of #3
janhoy added a commit that referenced this pull request May 22, 2026
All files in this commit are net-new — no existing code is modified.

**osbenchmark/conversion/ — OpenSearch→Solr workload converter:**
- detector.py: identify whether a workload targets OpenSearch or Solr
- field.py: field name normalization rules
- schema.py: translate OpenSearch index mappings to Solr schema XML
- query.py: translate OpenSearch Query DSL operations to Solr query syntax
- workload_converter.py: orchestrate full workload directory conversion
Tests: tests/unit/solr/conversion/ and tests/unit/solr/test_workload_converter.py

**osbenchmark/builder/solr_provisioner.py:**
Provision and configure a Solr cluster (collection creation, configset upload,
schema application) as a drop-in replacement for the OpenSearch provisioner.
Test: tests/unit/solr/test_provisioner.py

**osbenchmark/builder/installers/preparers/solr_preparer.py:**
Prepare a Solr node installation (derived from opensearch_preparer.py, adapted
for Solr directory layout and startup options).
Test: tests/builder/installers/preparers/solr_preparer_test.py

**osbenchmark/result_writer.py:**
Write benchmark results to filesystem in JSON/CSV formats for Solr runs.
Test: tests/unit/solr/test_result_writer.py

**solrbenchmark/ package:**
Thin top-level package and entry point (solr-benchmark CLI) that will replace
opensearch-benchmark once the project is accepted into ASF.

**tests/unit/solr/:** Full unit test suite for all new Solr modules.
tests/unit/test_telemetry.py: new telemetry test replacing the old telemetry_test.py.

None of these modules are wired into the main CLI yet; that happens in the
next commit.

Part of #3
@janhoy janhoy deleted the port/apache-solr-benchmark branch May 22, 2026 00:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants