LoCoMo Benchmark Results — NEXO Brain vs GPT-4, LLaMA-3, Gemini #3

wazionapps · 2026-03-24T13:33:20Z

wazionapps
Mar 24, 2026
Maintainer

We evaluated NEXO Brain on the LoCoMo benchmark (ACL 2024) — a peer-reviewed dataset that tests long-term conversation memory with 1,986 questions across 10 multi-session conversations.

Results (v0.5.0)

System	F1	Adversarial	Hardware
NEXO Brain v0.5.0	0.588	93.3%	CPU only
GPT-4 (128K full context)	0.379	—	GPU cloud
Gemini Pro 1.0	0.313	—	GPU cloud
LLaMA-3 70B	0.295	—	A100 GPU
GPT-3.5 + Contriever RAG	0.283	—	GPU

+55% vs GPT-4. Running entirely on CPU.

Highlights

F1: 0.588 — highest published score on LoCoMo
93.3% adversarial rejection — reliably says 'I don't know'
74.9% recall across 1,986 questions
Open-domain F1: 0.637 | Multi-hop: 0.333 | Temporal: 0.326
768-dim embeddings (BAAI/bge-base-en-v1.5) — CPU only
First MCP memory server benchmarked on peer-reviewed data

v0.5.0 Features

768-dim embeddings (upgraded from 384)
Hybrid search (vector + BM25 via FTS5)
Cross-encoder reranking (MiniLM-L-6-v2)
Multi-query decomposition
Intelligent chunking (overlapping 3-turn)
Session summaries
Adaptive Ebbinghaus decay
Temporal indexing
Auto-migration (384→768 transparent)

Raw Data

Full results in benchmarks/locomo/results/

wazionapps · 2026-03-24T16:35:05Z

wazionapps
Mar 24, 2026
Maintainer Author

Updated Results — NEXO Brain v0.5.0

We've improved significantly since the initial benchmark:

System	F1	Recall	Hardware
NEXO Brain v0.5.0	0.588	74.9%	CPU only
NEXO Brain v0.3.6	0.297	45.9%	CPU only
GPT-4 (128K full context)	0.379	—	GPU cloud
Gemini Pro 1.0	0.313	—	GPU cloud

+55% vs GPT-4. +98% from our initial score.

New in v0.5.0: 768-dim embeddings (bge-base), hybrid search (vector+BM25), cross-encoder reranking, multi-query decomposition, intelligent chunking, session summaries, adaptive decay, temporal indexing.

All improvements available via npm install nexo-brain@0.5.0. Auto-migration for existing users.

0 replies

wazionapps · 2026-03-31T02:27:41Z

wazionapps
Mar 31, 2026
Maintainer Author

Updated note: these results were measured on v0.5.0. Since then, NEXO Brain has added the Cognitive Cortex (v1.0), Knowledge Graph queries, and Smart Startup context loading — all of which should improve recall accuracy significantly.

We plan to re-run the LoCoMo benchmark on v1.4+ and publish the updated numbers. If anyone wants to run it on their own setup, the benchmark script is straightforward to adapt.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

LoCoMo Benchmark Results — NEXO Brain vs GPT-4, LLaMA-3, Gemini #3

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

LoCoMo Benchmark Results — NEXO Brain vs GPT-4, LLaMA-3, Gemini #3

Uh oh!

Uh oh!

wazionapps Mar 24, 2026 Maintainer

Results (v0.5.0)

Highlights

v0.5.0 Features

Raw Data

Replies: 2 comments

Uh oh!

wazionapps Mar 24, 2026 Maintainer Author

Updated Results — NEXO Brain v0.5.0

Uh oh!

wazionapps Mar 31, 2026 Maintainer Author

wazionapps
Mar 24, 2026
Maintainer

wazionapps
Mar 24, 2026
Maintainer Author

wazionapps
Mar 31, 2026
Maintainer Author