bench: 3-model encoder comparison (BGE vs BCE vs E5)#8
Open
mukund-setti wants to merge 2 commits into
Open
Conversation
added 2 commits
April 25, 2026 13:51
The project switched from OpenRouter to Gemini after the initial update/ landed; the LLM auto-mode probe was still reading the deleted openrouter_api_key field, so the heuristic path was always selected even when a real key was configured. Probe gemini_api_key / google_api_key first, fall back to openrouter_api_key, and treat all of them as optional so offline mode keeps working. Made-with: Cursor
Compares BAAI/bge-small-en-v1.5, maidalun1020/bce-embedding-base_v1, and intfloat/e5-small-v2 on the 15-query eval set with BGE-reranker-base held constant. Quality identical across all 3 (12/14 Recall@3). BGE-small wins on latency (3.2s/query vs 5.7s BCE vs 7.6s E5), cold start (9.7s vs 29.7s BCE), and footprint (33MB vs 280MB BCE). Decision: keep BGE-small. Branch preserved as Q&A evidence, not intended to merge.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Defensible answer to "why did you choose BGE-small?" — backed by data on our actual corpus.
This branch is Q&A evidence, not deployment. Intended to stay as a design-rationale artifact. Do not merge unless we want
benchmark.pyand the encodermodel_nameextension onmain.What's in this PR
agent/scripts/benchmark.py— encoder bake-off harness. Reuses the 15 queries fromeval.py, holds the reranker constant (BGE-reranker-base), wipes the index between models for clean cold-start numbers.agent/scripts/benchmark-results.md— full results table, decision rationale, per-query top-3 breakdowns for all three models.agent/brain_agents/retrieval/encoder.py— extendedBGEEncoderto accept any sentence-transformers model via themodel_nameparameter. Added per-family prompt formatting (BGE query prefix, E5query:/passage:prefixes, BCE no-prefix). Cache fingerprint includes the doc prefix so prompt-format changes invalidate cached embeddings. Default behavior unchanged.Findings
bge-small-en-v1.5bce-embedding-base_v1e5-small-v2Recall is identical across all three models. All encoders surfaced the same top-3 candidates for 14/15 queries on this corpus — modern dense encoders converge on obviously-relevant content for small, well-structured corpora. The reranker (held constant) then ordered the identical candidate sets identically.
BGE-small wins on operational characteristics:
The latency gap has clean explanations: BCE is 768-dim (2x BGE's 384), and E5 prepends
query:/passage:tokens to every input.Adversarial behavior (the cookies query): all three correctly held rerank confidence at 0.500, indicating no false confidence on out-of-distribution input. E5 chose a different (still wrong) top-1 file than BGE/BCE.
Decision
Keep
BAAI/bge-small-en-v1.5. Same quality, faster, smaller, ecosystem fit. The encoder API now supports swapping models cleanly viamodel_nameif a future workload demands it — no rewrite required.Reproducibility
cd agent uv run python scripts/benchmark.pyPrints the summary table to stdout and writes the full report to
scripts/benchmark-results.md.