Hybrid semantic + keyword search for OpenClaw memory files.
Searches .md files in an OpenClaw workspace using OpenAI embeddings combined with BM25 keyword scoring. Returns the most relevant chunks ranked by a hybrid score (0.7 vector similarity + 0.3 BM25).
- Markdown-aware chunking — splits on
##headers, respects paragraph boundaries, merges tiny sections - BM25 hybrid scoring — 70% cosine similarity from embeddings + 30% BM25 keyword relevance
- Incremental caching — only re-embeds files that changed since last run
- Configurable file discovery — indexes
memory/,bank/,MEMORY.md,USER.md,IDENTITY.md
- Node.js 18+ (uses native
fetch) - OpenAI API key
git clone https://github.com/auriwren/openclaw-semantic-search.git
cd openclaw-semantic-search
chmod +x semantic-searchThen either add the directory to your PATH or copy/symlink semantic-search into your OpenClaw tools/ directory.
# Basic search
semantic-search "what happened last week"
# Limit results
semantic-search "project deadlines" --limit 10
# JSON output (for programmatic use)
semantic-search "meeting notes" --jsonHuman-readable mode shows each matching chunk with its source file, line number, and relevance score:
🔍 Search: "what happened last week"
━━━ memory/2026-02-07.md:15 (87%) ━━━
## Morning
Had a productive session working on the semantic search tool...
━━━ memory/2026-02-06.md:42 (74%) ━━━
## Evening
Wrapped up the week's tasks...
JSON mode (--json) returns structured data:
{
"query": "what happened last week",
"results": [
{
"source": "memory/2026-02-07.md",
"lineStart": 15,
"score": 0.872,
"text": "## Morning\nHad a productive session..."
}
]
}| Environment Variable | Description | Default |
|---|---|---|
OPENAI_API_KEY |
Required. Your OpenAI API key | — |
OPENCLAW_WORKSPACE |
Path to the OpenClaw workspace to search | Current directory |
OPENCLAW_CACHE |
Directory for the embeddings cache file | ~/.openclaw/cache |
Files are split into chunks using markdown structure:
- Split on
#,##,###headers into sections - Sections under 100 characters are merged with the next section
- Sections over 800 characters are split at paragraph boundaries (double newlines)
- Header text is preserved as context in each sub-chunk
Standard BM25 with k1 = 1.2 and b = 0.75. Query and document text are tokenized into lowercase alphanumeric words. IDF uses the standard log formula. BM25 scores are normalized to 0-1 before combining.
Each chunk gets two scores:
- Cosine similarity (0-1) between the query embedding and chunk embedding using OpenAI
text-embedding-3-small - BM25 score (normalized 0-1) for keyword relevance
Final score = 0.7 * cosine + 0.3 * BM25
Embeddings are cached in memory-embeddings.json in the cache directory. On each run, only files with changed modification times are re-embedded. Deleted files are automatically cleaned from the cache.
This tool is designed to work as an OpenClaw custom tool. Place it in your workspace's tools/ directory and it will be available to your agent. The --json flag makes it easy to parse results programmatically.
MIT — see LICENSE.