Skip to content

bench: 3-model encoder comparison (BGE vs BCE vs E5)#8

Open
mukund-setti wants to merge 2 commits into
mainfrom
retrieval-benchmark
Open

bench: 3-model encoder comparison (BGE vs BCE vs E5)#8
mukund-setti wants to merge 2 commits into
mainfrom
retrieval-benchmark

Conversation

@mukund-setti
Copy link
Copy Markdown
Collaborator

Purpose

Defensible answer to "why did you choose BGE-small?" — backed by data on our actual corpus.

This branch is Q&A evidence, not deployment. Intended to stay as a design-rationale artifact. Do not merge unless we want benchmark.py and the encoder model_name extension on main.

What's in this PR

  • agent/scripts/benchmark.py — encoder bake-off harness. Reuses the 15 queries from eval.py, holds the reranker constant (BGE-reranker-base), wipes the index between models for clean cold-start numbers.
  • agent/scripts/benchmark-results.md — full results table, decision rationale, per-query top-3 breakdowns for all three models.
  • agent/brain_agents/retrieval/encoder.py — extended BGEEncoder to accept any sentence-transformers model via the model_name parameter. Added per-family prompt formatting (BGE query prefix, E5 query:/passage: prefixes, BCE no-prefix). Cache fingerprint includes the doc prefix so prompt-format changes invalidate cached embeddings. Default behavior unchanged.

Findings

Model Cold start (s) Warm load (s) Mean query (ms) Recall@1 strict Recall@3
bge-small-en-v1.5 9.7 0.04 3152 9/14 12/14
bce-embedding-base_v1 29.7 0.01 5716 9/14 12/14
e5-small-v2 11.6 0.01 7615 9/14 12/14

Recall is identical across all three models. All encoders surfaced the same top-3 candidates for 14/15 queries on this corpus — modern dense encoders converge on obviously-relevant content for small, well-structured corpora. The reranker (held constant) then ordered the identical candidate sets identically.

BGE-small wins on operational characteristics:

  • Latency — 1.8x faster than BCE per query, 2.4x faster than E5
  • Cold start — 3x faster than BCE
  • Footprint — 33MB vs 280MB (BCE) vs ~130MB (E5)
  • Ecosystem fit with BGE-reranker-base — same vocabulary, simpler joint cache

The latency gap has clean explanations: BCE is 768-dim (2x BGE's 384), and E5 prepends query: /passage: tokens to every input.

Adversarial behavior (the cookies query): all three correctly held rerank confidence at 0.500, indicating no false confidence on out-of-distribution input. E5 chose a different (still wrong) top-1 file than BGE/BCE.

Decision

Keep BAAI/bge-small-en-v1.5. Same quality, faster, smaller, ecosystem fit. The encoder API now supports swapping models cleanly via model_name if a future workload demands it — no rewrite required.

Reproducibility

cd agent
uv run python scripts/benchmark.py

Prints the summary table to stdout and writes the full report to scripts/benchmark-results.md.

Mukund Ummadisetti added 2 commits April 25, 2026 13:51
The project switched from OpenRouter to Gemini after the initial update/
landed; the LLM auto-mode probe was still reading the deleted
openrouter_api_key field, so the heuristic path was always selected even
when a real key was configured. Probe gemini_api_key / google_api_key
first, fall back to openrouter_api_key, and treat all of them as optional
so offline mode keeps working.

Made-with: Cursor
Compares BAAI/bge-small-en-v1.5, maidalun1020/bce-embedding-base_v1, and intfloat/e5-small-v2 on the 15-query eval set with BGE-reranker-base held constant. Quality identical across all 3 (12/14 Recall@3). BGE-small wins on latency (3.2s/query vs 5.7s BCE vs 7.6s E5), cold start (9.7s vs 29.7s BCE), and footprint (33MB vs 280MB BCE). Decision: keep BGE-small. Branch preserved as Q&A evidence, not intended to merge.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants