GitHub - marklubin/lens-benchmark: LENS - AI Memory Benchmark - Memory as Experience, Not Facts

> WHAT IS LENS //

Most retrieval benchmarks dump a static corpus and test search quality. Real memory systems receive information incrementally — a support ticket each day, a sensor reading each hour, a clinical note each week — and must surface patterns that only emerge after enough data accumulates.

LENS streams timestamped episodes into your memory system chronologically, pauses at checkpoints to ask questions requiring synthesis across many episodes, and scores whether your system enables longitudinal reasoning — not just keyword retrieval.

If a single episode can answer the question, the benchmark is broken. LENS ensures signal only emerges from the progression.

> RESULTS //

V2: Memory Strategy Ablation (Current)

Setup: 10 scopes, 7 consolidation policies, M=3 repetitions, Fact F1 scoring. 2,100 answers generated, 1,900 graded (90.5%).

V2 isolates the memory consolidation strategy from retrieval architecture. All policies use the same underlying storage, embeddings, search, and agent loop — only the memory management policy varies.

Rank	Policy	Fact F1	Description
1	core_faceted	0.466	4 parallel folds (entity/relation/event/cause) + merge
2	summary	0.443	Progressive map-reduce summarization
3	core	0.441	Single-fold working memory (Letta/MemGPT pattern)
4	core_structured	0.432	Schema-driven structured observations (Mastra/ACE pattern)
5	core_maintained	0.398	Core memory + iterative refinement
6	base	0.381	Raw BM25 + semantic retrieval, no synthesis
7	null	0.055	No memory (parametric knowledge only)

Key findings:

Any memory beats no memory — the null→base gap (+0.326) accounts for 79% of the total improvement
Faceted decomposition wins — 4 parallel cognitive folds capture more signal than a single pass (+0.025 over core)
Refinement is dangerous — iterative consolidation prunes useful signal (-0.043 vs core)
No universal best strategy — Kendall's W = 0.145 (weak concordance); different scopes favor different policies
Domain matters more than strategy — scope difficulty spans 0.109 to 0.763, a 6.9x range

Full results, per-scope breakdown, and statistical analysis: LEADERBOARD.md

Research brief (PDF with figures): v2-synix-benchmark/studies/grid/brief/research_brief.pdf

V1: Adapter Benchmark (Legacy)

V1 compared 11 memory system architectures (Letta, GraphRAG, SQLite variants, etc.) across 6 scopes. The headline finding was that agent query quality — not memory architecture — is the binding constraint. See LEADERBOARD.md for V1 results and methodology.

> HOW IT WORKS //

                    ┌─────────────┐
                    │  spec.yaml  │    Dataset definition
                    └──────┬──────┘
                           │
                    ┌──────▼──────┐
                    │   Episodes  │    Signal + distractor episodes
                    └──────┬──────┘
                           │ stream chronologically
                    ┌──────▼──────┐
                    │  Bank Build │    Chunk → embed → search index
                    │  + Policy   │    + consolidation (fold/summary/faceted)
                    └──────┬──────┘
                           │ at checkpoints...
                    ┌──────▼──────┐
                    │    Agent    │    Tool-use LLM interrogates
                    │  + Tools   │    memory via search + context
                    └──────┬──────┘
                           │
                    ┌──────▼──────┐
                    │   Scorer    │    Fact F1 via few-shot
                    │  (per-fact) │    LLM grading
                    └─────────────┘

Episodes stream chronologically — Signal episodes follow a 5-phase narrative arc (baseline → early signal → red herring → escalation → root cause), interleaved with format-matched distractors.
Bank build applies consolidation policy — Episodes are chunked and indexed. Depending on the policy, derived context is synthesized (fold, summary, faceted decomposition, etc.) and injected into the agent's system prompt.
At checkpoints, an LLM agent interrogates memory — The agent answers questions using memory_search and injected context. No direct access to raw episodes.
Fact F1 scoring — Each key fact is graded as present/partial/absent by a few-shot LLM judge. F1 is computed across all facts per question.

> QUICK START //

# Clone and install
git clone https://github.com/synix-dev/lens-benchmark.git
cd lens-benchmark
uv sync --all-extras

# Run tests
uv run pytest tests/unit/ -v

See QUICKSTART.md for a full walkthrough.

> CONTRIBUTE A SCOPE //

Scopes define benchmark scenarios. Each scope has:

A domain (system logs, clinical notes, financial reports, ...)
A 5-phase narrative arc with signal distributed across episodes
Key facts that require multi-episode synthesis
Questions at checkpoints testing longitudinal reasoning

Current scopes span: cascading failures, financial irregularity, clinical signals, environmental drift, insider threats, market regimes, jailbreak detection, corporate acquisition, shadow APIs, clinical trials, zoning corruption, therapy chat, implicit decisions, epoch classification, value inversion, and parking friction.

Full guide: SCOPE_GUIDE.md

> PROJECT STRUCTURE //

src/lens/
  adapters/          MemoryAdapter ABC, null/sqlite builtins, registry
  agent/             Agent harness, tool bridge, budget enforcement
  cli/               Click CLI (run, score, report, smoke, ...)
  core/              Episode, Question, GroundTruth, ScoreCard
  datagen/synix/     Two-stage dataset generation pipeline
  datasets/          Dataset loading
  matcher/           Answer matching
  report/            Report generation
  runner/            Benchmark runner with EpisodeVault anticheat
  scorer/            3-tier scoring (mechanical, judge, differential)
datasets/scopes/     16 scope specifications + generated artifacts
tests/unit/          1040 unit tests
v2-synix-benchmark/  V2 ablation study workspace
  src/bench/         Bank builder, policies, agent, scorer, runtime
  studies/grid/      Full grid results, figures, research brief
docs/                Documentation and guides

> DOCUMENTATION //

Document	Description
Quick Start	Install and run
Scope Guide	Design and build a benchmark scope
Leaderboard	V1 + V2 results and methodology
Research Brief	V2 ablation study (PDF)
Architecture	Core data flow and scoring internals
Methodology	Dataset generation and contamination prevention
Contributing	How to contribute

> CITATION //

@software{lens_benchmark,
  title  = {LENS: Longitudinal Evidence-backed Narrative Signals},
  author = {Mark Lubin},
  year   = {2025},
  url    = {https://github.com/synix-dev/lens-benchmark}
}

> LICENSE //

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 102 Commits
.github/workflows		.github/workflows
configs		configs
datasets		datasets
docs		docs
experiments		experiments
infra/modal		infra/modal
kv_agent		kv_agent
lens-initial-run-review		lens-initial-run-review
output		output
query_plans		query_plans
results		results
scripts		scripts
src/lens		src/lens
tests		tests
v2-synix-benchmark		v2-synix-benchmark
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LEADERBOARD.md		LEADERBOARD.md
LICENSE		LICENSE
README.md		README.md
comparison.md		comparison.md
constrained_validation_state.lock		constrained_validation_state.lock
lens_benchmark_logo_transparent.svg		lens_benchmark_logo_transparent.svg
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

> WHAT IS LENS //

> RESULTS //

V2: Memory Strategy Ablation (Current)

V1: Adapter Benchmark (Legacy)

> HOW IT WORKS //

> QUICK START //

> CONTRIBUTE A SCOPE //

> PROJECT STRUCTURE //

> DOCUMENTATION //

> CITATION //

> LICENSE //

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

> WHAT IS LENS //

> RESULTS //

V2: Memory Strategy Ablation (Current)

V1: Adapter Benchmark (Legacy)

> HOW IT WORKS //

> QUICK START //

> CONTRIBUTE A SCOPE //

> PROJECT STRUCTURE //

> DOCUMENTATION //

> CITATION //

> LICENSE //

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages