Hybrid lexical + semantic search for MediaWiki, using OpenSearch's k-NN and ML Commons plugins. Designed to live alongside CirrusSearch — BM25 keyword retrieval is preserved unchanged; this extension adds sentence-transformer embeddings on top and rank-fuses the two signals.
Status: scaffold (v0.1.0). The
extension.jsonships, hooks/services land in subsequent phases per docs/plan.md.
- Hosts a multilingual sentence-transformer model inside your existing OpenSearch cluster.
- Maintains a parallel
<wiki>_embeddingsindex per wiki with HNSW vector entries for each page (or chunk thereof). - Re-embeds pages on edit via a deferred MediaWiki job — no synchronous network calls during page save.
- Exposes
Special:SemanticSearchandaction=semanticsearch(read-only API), returning hybrid (BM25 + cosine k-NN) results merged via Reciprocal Rank Fusion. - Falls back gracefully to BM25-only if the model is undeployed or OpenSearch is unreachable.
Originally written for the Seminaverbi wikifarm (a multilingual Wikibase repo run by the Catholic Digital Commons Foundation), but contains no project-specific code — every endpoint, model id, index name, and fusion weight is configurable.
- MediaWiki ≥ 1.43.
- An OpenSearch cluster with the
opensearch-knnandopensearch-mlplugins enabled (default in OpenSearch 2.x). Compatible with OpenSearch 1.3+ ifopensearch-mlis installed manually. - One of the supported sentence-transformer models loaded into OpenSearch ML Commons (recommended:
huggingface/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2for multi-language wikis, 384-dim). - Optional but strongly recommended:
Extension:CirrusSearchfor the BM25 leg of the hybrid query.
All settings start with $wgSemanticSearch…. See extension.json for the full list and defaults; the most important ones:
wfLoadExtension( 'SemanticSearch' );
$wgSemanticSearchEnabled = true;
$wgSemanticSearchOpenSearchUrl = 'http://127.0.0.1:9200';
$wgSemanticSearchModelId = '<your-deployed-ml-commons-model-id>';
$wgSemanticSearchEmbeddingsIndexPattern = '%s_embeddings'; // %s -> wiki id
$wgSemanticSearchEmbeddingDimension = 384;
$wgSemanticSearchNamespaces = [ NS_MAIN ];
$wgSemanticSearchHybridBm25Weight = 0.5;
$wgSemanticSearchHybridVectorWeight = 0.5;In a wikifarm, the same extension.json can be loaded by every wiki and tuned per-wiki via $wgConf overrides.
This extension grows OpenSearch storage and RAM usage. The companion cdcf-tools repo provides a cdcf_capacity_guard.sh helper that the maintenance scripts and job queue use to verify free disk / RAM / heap before each batch and to abort cleanly if thresholds are breached. See docs/capacity-baseline.md for the projected growth and reversal procedures.
extension.json
includes/
Hook/ MW hook handlers
Job/ Deferred re-embedding jobs
Embedding/ OpenSearch ML Commons client
Search/ Hybrid query construction and rank fusion
SpecialPage/ Special:SemanticSearch
Api/ action=semanticsearch
maintenance/ BuildEmbeddings.php and other CLI scripts
i18n/ User-facing messages (en + qqq baseline; others welcome)
docs/ Plan, capacity baseline, runbook
tests/phpunit/ Unit and integration tests
Standard MediaWiki extension contributing flow:
git clone https://github.com/CatholicOS/mediawiki-extensions-SemanticSearch.git \
/path/to/mediawiki/extensions/SemanticSearch
composer install --working-dir=/path/to/mediawiki/extensions/SemanticSearch
php /path/to/mediawiki/maintenance/run.php update --quickLinting follows the MediaWiki PHPCS standard.
composer test # phpcs + phan + phpunit
composer phpcs:fix # auto-fix where possibleGPL-2.0-or-later, matching MediaWiki upstream.
Maintained by the Catholic Digital Commons Foundation. External contributions welcome — see CONTRIBUTING.md.