Skip to content

CatholicOS/mediawiki-extensions-SemanticSearch

Repository files navigation

SemanticSearch (MediaWiki extension)

Hybrid lexical + semantic search for MediaWiki, using OpenSearch's k-NN and ML Commons plugins. Designed to live alongside CirrusSearch — BM25 keyword retrieval is preserved unchanged; this extension adds sentence-transformer embeddings on top and rank-fuses the two signals.

Status: scaffold (v0.1.0). The extension.json ships, hooks/services land in subsequent phases per docs/plan.md.

What it does (when complete)

  • Hosts a multilingual sentence-transformer model inside your existing OpenSearch cluster.
  • Maintains a parallel <wiki>_embeddings index per wiki with HNSW vector entries for each page (or chunk thereof).
  • Re-embeds pages on edit via a deferred MediaWiki job — no synchronous network calls during page save.
  • Exposes Special:SemanticSearch and action=semanticsearch (read-only API), returning hybrid (BM25 + cosine k-NN) results merged via Reciprocal Rank Fusion.
  • Falls back gracefully to BM25-only if the model is undeployed or OpenSearch is unreachable.

Originally written for the Seminaverbi wikifarm (a multilingual Wikibase repo run by the Catholic Digital Commons Foundation), but contains no project-specific code — every endpoint, model id, index name, and fusion weight is configurable.

Requirements

  • MediaWiki ≥ 1.43.
  • An OpenSearch cluster with the opensearch-knn and opensearch-ml plugins enabled (default in OpenSearch 2.x). Compatible with OpenSearch 1.3+ if opensearch-ml is installed manually.
  • One of the supported sentence-transformer models loaded into OpenSearch ML Commons (recommended: huggingface/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 for multi-language wikis, 384-dim).
  • Optional but strongly recommended: Extension:CirrusSearch for the BM25 leg of the hybrid query.

Configuration

All settings start with $wgSemanticSearch…. See extension.json for the full list and defaults; the most important ones:

wfLoadExtension( 'SemanticSearch' );

$wgSemanticSearchEnabled                = true;
$wgSemanticSearchOpenSearchUrl          = 'http://127.0.0.1:9200';
$wgSemanticSearchModelId                = '<your-deployed-ml-commons-model-id>';
$wgSemanticSearchEmbeddingsIndexPattern = '%s_embeddings';   // %s -> wiki id
$wgSemanticSearchEmbeddingDimension     = 384;
$wgSemanticSearchNamespaces             = [ NS_MAIN ];
$wgSemanticSearchHybridBm25Weight       = 0.5;
$wgSemanticSearchHybridVectorWeight     = 0.5;

In a wikifarm, the same extension.json can be loaded by every wiki and tuned per-wiki via $wgConf overrides.

Capacity & operational considerations

This extension grows OpenSearch storage and RAM usage. The companion cdcf-tools repo provides a cdcf_capacity_guard.sh helper that the maintenance scripts and job queue use to verify free disk / RAM / heap before each batch and to abort cleanly if thresholds are breached. See docs/capacity-baseline.md for the projected growth and reversal procedures.

Layout

extension.json
includes/
  Hook/         MW hook handlers
  Job/          Deferred re-embedding jobs
  Embedding/    OpenSearch ML Commons client
  Search/       Hybrid query construction and rank fusion
  SpecialPage/  Special:SemanticSearch
  Api/          action=semanticsearch
maintenance/    BuildEmbeddings.php and other CLI scripts
i18n/           User-facing messages (en + qqq baseline; others welcome)
docs/           Plan, capacity baseline, runbook
tests/phpunit/  Unit and integration tests

Development

Standard MediaWiki extension contributing flow:

git clone https://github.com/CatholicOS/mediawiki-extensions-SemanticSearch.git \
    /path/to/mediawiki/extensions/SemanticSearch
composer install --working-dir=/path/to/mediawiki/extensions/SemanticSearch
php /path/to/mediawiki/maintenance/run.php update --quick

Linting follows the MediaWiki PHPCS standard.

composer test         # phpcs + phan + phpunit
composer phpcs:fix    # auto-fix where possible

License

GPL-2.0-or-later, matching MediaWiki upstream.

Governance

Maintained by the Catholic Digital Commons Foundation. External contributions welcome — see CONTRIBUTING.md.

About

Hybrid lexical + semantic search for MediaWiki using OpenSearch k-NN and ML Commons

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages