SOLR-18255 Jans initial port from OSB to solr-benchmark (6 logical commits)#3
Conversation
epugh
left a comment
There was a problem hiding this comment.
I poked around, and other then noticing some solr-benchmark where I expected solr-orbit, this looks good. Maybe in a future pr we fix the directory names?
Haha :)
Not yet re-branded this repo, so that is expected. Being in flux and needing more steps, I'll do CTR for this PR and merge the 6 commits as is. Normally I'd leave it open for 3-4 days to allow reviews, but I believe in this early stage, as you say Eric, it is acceptable to focus on progress as long as we follow best practices and others can review after the fact. |
Replace OpenSearch-specific project governance with ASF-compatible equivalents: add NOTICE and create-notice.sh for ASF IP compliance, bump version to 0.1.0 to signal the fresh start, update CONTRIBUTING.md for Solr context, remove MAINTAINERS/RELEASE/TRIAGE files that don't apply to an ASF incubating project (those processes are defined by the ASF), drop .whitesource/.fossa.yml OSS-scanning configs that were tied to the OpenSearch project infrastructure. Part of apache#3
Adapt CI/CD and project metadata for the Solr port: - Remove workflows that depended on OpenSearch infrastructure (backport, add-untriaged, integ-test, publish-release, docker-push-release) — these will be rebuilt once the project has its own ASF infrastructure - Add docs.yml workflow (commented out pending docs host decision) - Simplify unit-test and docker-build workflows to remove OpenSearch-specific steps - Update .ci/build.sh and check_deprecated_terms.py for Solr naming - Remove CODEOWNERS and issue templates tied to the old team structure - Add AGENTS.md: guidance for AI coding assistants working in this repo - Refresh Makefile, tox.ini, .pylintrc, .gitignore for the new project shape Part of apache#3
Complete documentation overhaul for the Solr port: - Replace OpenSearch-focused Jekyll docs site with Solr-specific content covering installation, configuration, running benchmarks, workload creation, and the new OpenSearch→Solr converter - Remove legacy docs/api/ (OpenSearch API reference) and docs/user-guides/ in favor of the new docs/user-guide/ and docs/reference/ structure - Update README.md, DEVELOPER_GUIDE.md, PYTHON_SUPPORT_GUIDE.md, CREATE_WORKLOAD_GUIDE.md to use Solr terminology and solr-benchmark CLI - Add TODO.md: incubation checklist and known remaining work - Add it/README.md: integration test setup instructions - Remove opensearch_benchmark.png splash image Part of apache#3
Clean out everything that has no place in a Solr benchmark tool:
**Kafka / async HTTP / gRPC:**
- Remove kafka_client.py (Kafka producer for OpenSearch metrics streaming)
- Remove async_connection.py (OpenSearch async HTTP connection layer)
- Remove worker_coordinator/proto_helpers/ (gRPC bulk/query helpers)
- Remove osbenchmark/data_streaming/ package (Kafka data pipeline)
- Remove all corresponding unit tests
**Bundled binaries:**
- Remove osbenchmark/decompressors/pbzip2-{Darwin,Linux}-{arm64,x86_64,aarch64}
- Remove scripts/pbzip2
These binaries are not redistributable in an ASF project; decompression will
use the system pbzip2 or fallback to Python's bz2 module.
**OpenSearch-specific infrastructure:**
- Remove scripts/terraform/ (Terraform cluster provisioning for OpenSearch on AWS)
- Remove samples/ccr/ (OpenSearch cross-cluster replication sample)
- Remove tests for all of the above
Part of apache#3
All files in this commit are net-new — no existing code is modified. **osbenchmark/conversion/ — OpenSearch→Solr workload converter:** - detector.py: identify whether a workload targets OpenSearch or Solr - field.py: field name normalization rules - schema.py: translate OpenSearch index mappings to Solr schema XML - query.py: translate OpenSearch Query DSL operations to Solr query syntax - workload_converter.py: orchestrate full workload directory conversion Tests: tests/unit/solr/conversion/ and tests/unit/solr/test_workload_converter.py **osbenchmark/builder/solr_provisioner.py:** Provision and configure a Solr cluster (collection creation, configset upload, schema application) as a drop-in replacement for the OpenSearch provisioner. Test: tests/unit/solr/test_provisioner.py **osbenchmark/builder/installers/preparers/solr_preparer.py:** Prepare a Solr node installation (derived from opensearch_preparer.py, adapted for Solr directory layout and startup options). Test: tests/builder/installers/preparers/solr_preparer_test.py **osbenchmark/result_writer.py:** Write benchmark results to filesystem in JSON/CSV formats for Solr runs. Test: tests/unit/solr/test_result_writer.py **solrbenchmark/ package:** Thin top-level package and entry point (solr-benchmark CLI) that will replace opensearch-benchmark once the project is accepted into ASF. **tests/unit/solr/:** Full unit test suite for all new Solr modules. tests/unit/test_telemetry.py: new telemetry test replacing the old telemetry_test.py. None of these modules are wired into the main CLI yet; that happens in the next commit. Part of apache#3
This is the main functional change of the Solr port, touching all layers of the benchmark tool. This commit wires the previous five together. **setup.py / entry points:** - Remove opensearch-py, opensearch-protobufs, aiokafka dependencies - Add requests for Solr HTTP communication - Rename entry points: opensearch-benchmark→solr-benchmark, osb→sb (pending ASF acceptance) **osbenchmark/client.py — complete rewrite:** Replace the opensearch-py async client with a synchronous requests-based Solr HTTP client. Supports collection management, document indexing, and query execution against a Solr cluster. **osbenchmark/telemetry.py:** Replace OpenSearch-specific telemetry devices (JVM heap, GC, hot threads, etc.) with Solr equivalents using the Solr Metrics API and node stats endpoints. **osbenchmark/worker_coordinator/runner.py:** Adapt operation runners for Solr: bulk indexing via /update, queries via /select, collection admin operations. Remove OpenSearch-specific operations (snapshot, shrink, force-merge semantics, etc.). **osbenchmark/builder/ — OSB→Solr naming cleanup:** - Rename opensearch_distribution_downloader.py → distribution_downloader.py - Rename opensearch_source_downloader.py → source_downloader.py - Rename opensearch_distribution_repository_provider.py → distribution_repository_provider.py - Delete opensearch_preparer.py (replaced by solr_preparer.py in PR 5) - Delete core_plugin_source_downloader.py, external_plugin_source_downloader.py, plugin_distribution_downloader.py (OSB plugin infrastructure, not needed for Solr) - Update builder.py, provisioner.py, supplier.py to use the new Solr provisioner **osbenchmark/config.py, context.py, benchmark.py, benchmarkd.py:** Update configuration keys, context variables, and CLI help text for Solr. Remove OpenSearch-specific commands and flags; add Solr cluster URL handling. **osbenchmark/workload/ and osbenchmark/workload_generator/:** Adapt workload loading and workload generation for Solr collection schema. **osbenchmark/metrics.py, publisher.py:** Update metric names and summary report labels from OpenSearch to Solr terminology. **osbenchmark/resources/ — cluster config cleanup:** Remove resources/cluster_configs/1.0/ entirely (OpenSearch 1.x configs). Simplify resources/cluster_configs/main/ to Solr-relevant entries. Update benchmark.ini default configuration. **Integration and unit tests:** Update all existing tests to match new API shapes. Delete tests for removed functionality (telemetry_test.py replaced by tests/unit/test_telemetry.py, workload_generator corpus/index tests removed as workload_generator was refactored). Part of apache#3
964651b to
5559b28
Compare
Replace OpenSearch-specific project governance with ASF-compatible equivalents: add NOTICE and create-notice.sh for ASF IP compliance, bump version to 0.1.0 to signal the fresh start, update CONTRIBUTING.md for Solr context, remove MAINTAINERS/RELEASE/TRIAGE files that don't apply to an ASF incubating project (those processes are defined by the ASF), drop .whitesource/.fossa.yml OSS-scanning configs that were tied to the OpenSearch project infrastructure. Part of #3
Adapt CI/CD and project metadata for the Solr port: - Remove workflows that depended on OpenSearch infrastructure (backport, add-untriaged, integ-test, publish-release, docker-push-release) — these will be rebuilt once the project has its own ASF infrastructure - Add docs.yml workflow (commented out pending docs host decision) - Simplify unit-test and docker-build workflows to remove OpenSearch-specific steps - Update .ci/build.sh and check_deprecated_terms.py for Solr naming - Remove CODEOWNERS and issue templates tied to the old team structure - Add AGENTS.md: guidance for AI coding assistants working in this repo - Refresh Makefile, tox.ini, .pylintrc, .gitignore for the new project shape Part of #3
Complete documentation overhaul for the Solr port: - Replace OpenSearch-focused Jekyll docs site with Solr-specific content covering installation, configuration, running benchmarks, workload creation, and the new OpenSearch→Solr converter - Remove legacy docs/api/ (OpenSearch API reference) and docs/user-guides/ in favor of the new docs/user-guide/ and docs/reference/ structure - Update README.md, DEVELOPER_GUIDE.md, PYTHON_SUPPORT_GUIDE.md, CREATE_WORKLOAD_GUIDE.md to use Solr terminology and solr-benchmark CLI - Add TODO.md: incubation checklist and known remaining work - Add it/README.md: integration test setup instructions - Remove opensearch_benchmark.png splash image Part of #3
Clean out everything that has no place in a Solr benchmark tool:
**Kafka / async HTTP / gRPC:**
- Remove kafka_client.py (Kafka producer for OpenSearch metrics streaming)
- Remove async_connection.py (OpenSearch async HTTP connection layer)
- Remove worker_coordinator/proto_helpers/ (gRPC bulk/query helpers)
- Remove osbenchmark/data_streaming/ package (Kafka data pipeline)
- Remove all corresponding unit tests
**Bundled binaries:**
- Remove osbenchmark/decompressors/pbzip2-{Darwin,Linux}-{arm64,x86_64,aarch64}
- Remove scripts/pbzip2
These binaries are not redistributable in an ASF project; decompression will
use the system pbzip2 or fallback to Python's bz2 module.
**OpenSearch-specific infrastructure:**
- Remove scripts/terraform/ (Terraform cluster provisioning for OpenSearch on AWS)
- Remove samples/ccr/ (OpenSearch cross-cluster replication sample)
- Remove tests for all of the above
Part of #3
All files in this commit are net-new — no existing code is modified. **osbenchmark/conversion/ — OpenSearch→Solr workload converter:** - detector.py: identify whether a workload targets OpenSearch or Solr - field.py: field name normalization rules - schema.py: translate OpenSearch index mappings to Solr schema XML - query.py: translate OpenSearch Query DSL operations to Solr query syntax - workload_converter.py: orchestrate full workload directory conversion Tests: tests/unit/solr/conversion/ and tests/unit/solr/test_workload_converter.py **osbenchmark/builder/solr_provisioner.py:** Provision and configure a Solr cluster (collection creation, configset upload, schema application) as a drop-in replacement for the OpenSearch provisioner. Test: tests/unit/solr/test_provisioner.py **osbenchmark/builder/installers/preparers/solr_preparer.py:** Prepare a Solr node installation (derived from opensearch_preparer.py, adapted for Solr directory layout and startup options). Test: tests/builder/installers/preparers/solr_preparer_test.py **osbenchmark/result_writer.py:** Write benchmark results to filesystem in JSON/CSV formats for Solr runs. Test: tests/unit/solr/test_result_writer.py **solrbenchmark/ package:** Thin top-level package and entry point (solr-benchmark CLI) that will replace opensearch-benchmark once the project is accepted into ASF. **tests/unit/solr/:** Full unit test suite for all new Solr modules. tests/unit/test_telemetry.py: new telemetry test replacing the old telemetry_test.py. None of these modules are wired into the main CLI yet; that happens in the next commit. Part of #3
https://issues.apache.org/jira/browse/SOLR-18255
This PR contains the initial port of OpenSearch Benchmark (OSB) to work with Apache Solr. The fork point from OSB is tagged
osb_fork_point(OSB commit92982c56).The codebase retains the OSB Python package name (
osbenchmark) and directory structure for now; known work to do is tracked inTODO.mdand will likely be converted into JIRA tasks.How to review
The PR is structured as 6 commits in logical progression order. Each commit is independently coherent and reviewable in isolation. The recommended approach is to review one commit at a time using GitHub's commit view or
git log -p. The final commit is the largest, but by that point the project shape is established and the changes read more clearly in context.Summary of major changes
1. Solr-native client (
osbenchmark/client.py)The OpenSearch Python client (
opensearch-py) has been replaced with a purpose-builtSolrAdminClientandSolrClientthat communicate with Solr over HTTP usingrequests/pysolr. All collection management, document indexing, and query execution now goes through Solr's REST API (Collections API,/select,/update, etc.).2. Solr provisioner (
osbenchmark/builder/solr_provisioner.py)A new
SolrProvisionerreplaces the OpenSearch node provisioning machinery. It supports three deployment modes:from-distribution— downloads a released Solr binary fromdownloads.apache.orgor the ASF archive (including pre-9.0 paths).from-sources— builds Solr from a local checkout with Gradle.docker— pulls and starts the official Solr Docker image, including nightly builds.SolrDockerLauncherhandles container lifecycle. Version-aware logic handles the API differences between Solr 9.x and 10.x (e.g. collection creation flags).3. Solr-specific telemetry devices (
osbenchmark/telemetry.py)Six new
SolrTelemetryDevicesubclasses collect Solr-specific metrics during a run:SolrJvmStats,SolrNodeStats,SolrCollectionStats,SolrQueryStats,SolrIndexingStats,SolrCacheStats. These poll the Solr Metrics API and write results via the existingResultWriterpipeline.4. Solr runner operations (
osbenchmark/worker_coordinator/runner.py)56 OpenSearch-specific runner classes have been removed (KNN, ML connectors, vector datasets, data streams, index templates, pipelines, etc.). In their place, Solr-specific runners have been added under
SolrRunner:SolrBulkIndex,SolrSearch,SolrPaginatedSearch,SolrCommit,SolrOptimize,SolrWaitForMerges,SolrCreateCollection,SolrDeleteCollection.5. Workload model: index → collection (
osbenchmark/workload/)The workload domain model has been updated throughout:
Index/DataStream/IndexTemplate→CollectionIndexTemplate,ComponentTemplate,DataStreamand serverless/vector-related types removedCreateCollectionParamSource/DeleteCollectionParamSource/SolrSearchParamSource6. OSB-to-Solr workload converter (
osbenchmark/conversion/)A new converter pipeline (
workload_converter.py,detector.py,query.py,schema.py,field.py) translates an OpenSearch Benchmark workload into Solr format:bulk→bulk-index,force-merge→optimize, index mappings → Solr configsetssolrconfig.xml/managed-schema.xmlconfigset skeletonsolr-benchmark convert-workload; seedocs/converter/for details7. Metrics store simplified (
osbenchmark/metrics.py)OsMetricsStore,OsTestRunStore,OsResultsStore, andIndexTemplateProvider(all backed by OpenSearch) have been removed. The single supported store is nowFilesystemMetricsStore(JSON + CSV + SQLite on local disk), accessed viaLocalFilesystemResultWriter.8. Documentation site (
docs/)A full user-facing documentation site is included, built with Jekyll + just-the-docs. Key sections:
user-guide/(install, configure, workload authoring),reference/(telemetry, metrics, workload schema, commands),converter/(OSB migration guide),cluster-config/. Deployed to GitHub Pages via.github/workflows/docs.yml. Seedocs/README.mdfor local build instructions.9. ASF licence headers and housekeeping
pbzip2binaries removed;pbzip2is now an optional system prerequisite.CONTRIBUTING.md,DEVELOPER_GUIDE.md,README.mdrewritten for the Solr/ASF context.TODO.mdtracks remaining incubation steps (package rename, CI, release process, etc.).The changes are described by the 9 functional areas above regardless of which commit they land in. The 6-commit structure exists purely to aid review — it does not reflect the order in which the work was done.