Skip to content

auriwren/openclaw-semantic-search

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

openclaw-semantic-search

Hybrid semantic + keyword search for OpenClaw memory files.

What it does

Searches .md files in an OpenClaw workspace using OpenAI embeddings combined with BM25 keyword scoring. Returns the most relevant chunks ranked by a hybrid score (0.7 vector similarity + 0.3 BM25).

Features

  • Markdown-aware chunking — splits on ## headers, respects paragraph boundaries, merges tiny sections
  • BM25 hybrid scoring — 70% cosine similarity from embeddings + 30% BM25 keyword relevance
  • Incremental caching — only re-embeds files that changed since last run
  • Configurable file discovery — indexes memory/, bank/, MEMORY.md, USER.md, IDENTITY.md

Requirements

  • Node.js 18+ (uses native fetch)
  • OpenAI API key

Installation

git clone https://github.com/auriwren/openclaw-semantic-search.git
cd openclaw-semantic-search
chmod +x semantic-search

Then either add the directory to your PATH or copy/symlink semantic-search into your OpenClaw tools/ directory.

Usage

# Basic search
semantic-search "what happened last week"

# Limit results
semantic-search "project deadlines" --limit 10

# JSON output (for programmatic use)
semantic-search "meeting notes" --json

Output

Human-readable mode shows each matching chunk with its source file, line number, and relevance score:

🔍 Search: "what happened last week"

━━━ memory/2026-02-07.md:15 (87%) ━━━
## Morning
Had a productive session working on the semantic search tool...

━━━ memory/2026-02-06.md:42 (74%) ━━━
## Evening
Wrapped up the week's tasks...

JSON mode (--json) returns structured data:

{
  "query": "what happened last week",
  "results": [
    {
      "source": "memory/2026-02-07.md",
      "lineStart": 15,
      "score": 0.872,
      "text": "## Morning\nHad a productive session..."
    }
  ]
}

Configuration

Environment Variable Description Default
OPENAI_API_KEY Required. Your OpenAI API key
OPENCLAW_WORKSPACE Path to the OpenClaw workspace to search Current directory
OPENCLAW_CACHE Directory for the embeddings cache file ~/.openclaw/cache

How it works

Chunking strategy

Files are split into chunks using markdown structure:

  1. Split on #, ##, ### headers into sections
  2. Sections under 100 characters are merged with the next section
  3. Sections over 800 characters are split at paragraph boundaries (double newlines)
  4. Header text is preserved as context in each sub-chunk

BM25 parameters

Standard BM25 with k1 = 1.2 and b = 0.75. Query and document text are tokenized into lowercase alphanumeric words. IDF uses the standard log formula. BM25 scores are normalized to 0-1 before combining.

Hybrid scoring

Each chunk gets two scores:

  • Cosine similarity (0-1) between the query embedding and chunk embedding using OpenAI text-embedding-3-small
  • BM25 score (normalized 0-1) for keyword relevance

Final score = 0.7 * cosine + 0.3 * BM25

Caching

Embeddings are cached in memory-embeddings.json in the cache directory. On each run, only files with changed modification times are re-embedded. Deleted files are automatically cleaned from the cache.

Integration with OpenClaw

This tool is designed to work as an OpenClaw custom tool. Place it in your workspace's tools/ directory and it will be available to your agent. The --json flag makes it easy to parse results programmatically.

License

MIT — see LICENSE.

About

Hybrid semantic + keyword search for OpenClaw memory files

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors