Most retrieval benchmarks dump a static corpus and test search quality. Real memory systems receive information incrementally — a support ticket each day, a sensor reading each hour, a clinical note each week — and must surface patterns that only emerge after enough data accumulates.
LENS streams timestamped episodes into your memory system chronologically, pauses at checkpoints to ask questions requiring synthesis across many episodes, and scores whether your system enables longitudinal reasoning — not just keyword retrieval.
If a single episode can answer the question, the benchmark is broken. LENS ensures signal only emerges from the progression.
Setup: 10 scopes, 7 consolidation policies, M=3 repetitions, Fact F1 scoring. 2,100 answers generated, 1,900 graded (90.5%).
V2 isolates the memory consolidation strategy from retrieval architecture. All policies use the same underlying storage, embeddings, search, and agent loop — only the memory management policy varies.
| Rank | Policy | Fact F1 | Description |
|---|---|---|---|
| 1 | core_faceted | 0.466 | 4 parallel folds (entity/relation/event/cause) + merge |
| 2 | summary | 0.443 | Progressive map-reduce summarization |
| 3 | core | 0.441 | Single-fold working memory (Letta/MemGPT pattern) |
| 4 | core_structured | 0.432 | Schema-driven structured observations (Mastra/ACE pattern) |
| 5 | core_maintained | 0.398 | Core memory + iterative refinement |
| 6 | base | 0.381 | Raw BM25 + semantic retrieval, no synthesis |
| 7 | null | 0.055 | No memory (parametric knowledge only) |
Key findings:
- Any memory beats no memory — the null→base gap (+0.326) accounts for 79% of the total improvement
- Faceted decomposition wins — 4 parallel cognitive folds capture more signal than a single pass (+0.025 over core)
- Refinement is dangerous — iterative consolidation prunes useful signal (-0.043 vs core)
- No universal best strategy — Kendall's W = 0.145 (weak concordance); different scopes favor different policies
- Domain matters more than strategy — scope difficulty spans 0.109 to 0.763, a 6.9x range
Full results, per-scope breakdown, and statistical analysis: LEADERBOARD.md
Research brief (PDF with figures): v2-synix-benchmark/studies/grid/brief/research_brief.pdf
V1 compared 11 memory system architectures (Letta, GraphRAG, SQLite variants, etc.) across 6 scopes. The headline finding was that agent query quality — not memory architecture — is the binding constraint. See LEADERBOARD.md for V1 results and methodology.
┌─────────────┐
│ spec.yaml │ Dataset definition
└──────┬──────┘
│
┌──────▼──────┐
│ Episodes │ Signal + distractor episodes
└──────┬──────┘
│ stream chronologically
┌──────▼──────┐
│ Bank Build │ Chunk → embed → search index
│ + Policy │ + consolidation (fold/summary/faceted)
└──────┬──────┘
│ at checkpoints...
┌──────▼──────┐
│ Agent │ Tool-use LLM interrogates
│ + Tools │ memory via search + context
└──────┬──────┘
│
┌──────▼──────┐
│ Scorer │ Fact F1 via few-shot
│ (per-fact) │ LLM grading
└─────────────┘
- Episodes stream chronologically — Signal episodes follow a 5-phase narrative arc (baseline → early signal → red herring → escalation → root cause), interleaved with format-matched distractors.
- Bank build applies consolidation policy — Episodes are chunked and indexed. Depending on the policy, derived context is synthesized (fold, summary, faceted decomposition, etc.) and injected into the agent's system prompt.
- At checkpoints, an LLM agent interrogates memory — The agent answers questions using
memory_searchand injected context. No direct access to raw episodes. - Fact F1 scoring — Each key fact is graded as present/partial/absent by a few-shot LLM judge. F1 is computed across all facts per question.
# Clone and install
git clone https://github.com/synix-dev/lens-benchmark.git
cd lens-benchmark
uv sync --all-extras
# Run tests
uv run pytest tests/unit/ -vSee QUICKSTART.md for a full walkthrough.
Scopes define benchmark scenarios. Each scope has:
- A domain (system logs, clinical notes, financial reports, ...)
- A 5-phase narrative arc with signal distributed across episodes
- Key facts that require multi-episode synthesis
- Questions at checkpoints testing longitudinal reasoning
Current scopes span: cascading failures, financial irregularity, clinical signals, environmental drift, insider threats, market regimes, jailbreak detection, corporate acquisition, shadow APIs, clinical trials, zoning corruption, therapy chat, implicit decisions, epoch classification, value inversion, and parking friction.
Full guide: SCOPE_GUIDE.md
src/lens/
adapters/ MemoryAdapter ABC, null/sqlite builtins, registry
agent/ Agent harness, tool bridge, budget enforcement
cli/ Click CLI (run, score, report, smoke, ...)
core/ Episode, Question, GroundTruth, ScoreCard
datagen/synix/ Two-stage dataset generation pipeline
datasets/ Dataset loading
matcher/ Answer matching
report/ Report generation
runner/ Benchmark runner with EpisodeVault anticheat
scorer/ 3-tier scoring (mechanical, judge, differential)
datasets/scopes/ 16 scope specifications + generated artifacts
tests/unit/ 1040 unit tests
v2-synix-benchmark/ V2 ablation study workspace
src/bench/ Bank builder, policies, agent, scorer, runtime
studies/grid/ Full grid results, figures, research brief
docs/ Documentation and guides
| Document | Description |
|---|---|
| Quick Start | Install and run |
| Scope Guide | Design and build a benchmark scope |
| Leaderboard | V1 + V2 results and methodology |
| Research Brief | V2 ablation study (PDF) |
| Architecture | Core data flow and scoring internals |
| Methodology | Dataset generation and contamination prevention |
| Contributing | How to contribute |
@software{lens_benchmark,
title = {LENS: Longitudinal Evidence-backed Narrative Signals},
author = {Mark Lubin},
year = {2025},
url = {https://github.com/synix-dev/lens-benchmark}
}