bench: 3-model encoder comparison (BGE vs BCE vs E5) by mukund-setti · Pull Request #8 · SanjoyDat1/Brian

mukund-setti · 2026-04-25T21:15:11Z

Purpose

Defensible answer to "why did you choose BGE-small?" — backed by data on our actual corpus.

This branch is Q&A evidence, not deployment. Intended to stay as a design-rationale artifact. Do not merge unless we want benchmark.py and the encoder model_name extension on main.

What's in this PR

agent/scripts/benchmark.py — encoder bake-off harness. Reuses the 15 queries from eval.py, holds the reranker constant (BGE-reranker-base), wipes the index between models for clean cold-start numbers.
agent/scripts/benchmark-results.md — full results table, decision rationale, per-query top-3 breakdowns for all three models.
agent/brain_agents/retrieval/encoder.py — extended BGEEncoder to accept any sentence-transformers model via the model_name parameter. Added per-family prompt formatting (BGE query prefix, E5 query:/passage: prefixes, BCE no-prefix). Cache fingerprint includes the doc prefix so prompt-format changes invalidate cached embeddings. Default behavior unchanged.

Findings

Model	Cold start (s)	Warm load (s)	Mean query (ms)	Recall@1 strict	Recall@3
`bge-small-en-v1.5`	9.7	0.04	3152	9/14	12/14
`bce-embedding-base_v1`	29.7	0.01	5716	9/14	12/14
`e5-small-v2`	11.6	0.01	7615	9/14	12/14

Recall is identical across all three models. All encoders surfaced the same top-3 candidates for 14/15 queries on this corpus — modern dense encoders converge on obviously-relevant content for small, well-structured corpora. The reranker (held constant) then ordered the identical candidate sets identically.

BGE-small wins on operational characteristics:

Latency — 1.8x faster than BCE per query, 2.4x faster than E5
Cold start — 3x faster than BCE
Footprint — 33MB vs 280MB (BCE) vs ~130MB (E5)
Ecosystem fit with BGE-reranker-base — same vocabulary, simpler joint cache

The latency gap has clean explanations: BCE is 768-dim (2x BGE's 384), and E5 prepends query: /passage: tokens to every input.

Adversarial behavior (the cookies query): all three correctly held rerank confidence at 0.500, indicating no false confidence on out-of-distribution input. E5 chose a different (still wrong) top-1 file than BGE/BCE.

Decision

Keep BAAI/bge-small-en-v1.5. Same quality, faster, smaller, ecosystem fit. The encoder API now supports swapping models cleanly via model_name if a future workload demands it — no rewrite required.

Reproducibility

cd agent
uv run python scripts/benchmark.py

Prints the summary table to stdout and writes the full report to scripts/benchmark-results.md.

The project switched from OpenRouter to Gemini after the initial update/ landed; the LLM auto-mode probe was still reading the deleted openrouter_api_key field, so the heuristic path was always selected even when a real key was configured. Probe gemini_api_key / google_api_key first, fall back to openrouter_api_key, and treat all of them as optional so offline mode keeps working. Made-with: Cursor

Compares BAAI/bge-small-en-v1.5, maidalun1020/bce-embedding-base_v1, and intfloat/e5-small-v2 on the 15-query eval set with BGE-reranker-base held constant. Quality identical across all 3 (12/14 Recall@3). BGE-small wins on latency (3.2s/query vs 5.7s BCE vs 7.6s E5), cold start (9.7s vs 29.7s BCE), and footprint (33MB vs 280MB BCE). Decision: keep BGE-small. Branch preserved as Q&A evidence, not intended to merge.

Mukund Ummadisetti added 2 commits April 25, 2026 13:51

mukund-setti assigned arrana16 Apr 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bench: 3-model encoder comparison (BGE vs BCE vs E5)#8

bench: 3-model encoder comparison (BGE vs BCE vs E5)#8
mukund-setti wants to merge 2 commits into
mainfrom
retrieval-benchmark

mukund-setti commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mukund-setti commented Apr 25, 2026

Purpose

What's in this PR

Findings

Decision

Reproducibility

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants