GRADE

Grounded Reasoning & Analysis for Data in Education

GRADE is an open benchmark that measures how accurately and usefully AI systems analyze education program data. It was built by Pearl — an AI tutoring platform working directly with school districts — to answer a practical question: can AI produce education analytics that a program manager or district analyst could trust and act on?

The Problem

AI tools are being adopted across education to help program managers and analysts make sense of attendance records, session logs, outcome summaries, and research evidence. The quality of that analysis directly affects the decisions practitioners make about students.

General-purpose benchmarks don't measure this. A model can excel at broad reasoning tasks while still hallucinating attendance rates, misattributing causation to correlated signals, or failing to flag that a subgroup is too small to report on. GRADE was built to fill that gap.

Leaderboard

Official run, June 10–12 2026. Three packs: operations (11 tasks × 5 runs), outcomes (5 tasks × 1 run), equity research (10 tasks × 1 run). All rows ran with a 65,536-token completion budget so reasoning never crowds out the answer; judging is cross-family (no judge scores its own model family). Full caveats in docs/methodology.md; run notes in RUN_FINDINGS.md.

Operations (program-data analysis, 5 repetitions)

Rank	Model	Composite	Grounding	Insight	Evidence	Calibration	Consistency	Coverage
1	`openai/gpt-5.5@xhigh`	66.9%	91.1%	45.0%	77.5%	22.4%	63.0%	full
2	`openai/gpt-5.5@high`	65.4%	87.7%	46.3%	76.9%	18.5%	62.9%	full
3	`openai/gpt-5.5@medium`	64.9%	88.6%	44.2%	73.4%	17.8%	66.2%	full
4	`google/gemini-3.1-pro-preview`	63.5%	79.2%	45.0%	76.9%	31.2%	61.3%	8/11 [1]
5	`google/gemini-3.5-flash`	61.0%	79.5%	44.7%	69.4%	23.6%	61.2%	10/11 [1]
6	`openai/gpt-5.5@low`	52.4%	62.4%	35.7%	59.3%	23.2%	64.2%	full
7	`anthropic/claude-opus-4.8`	41.9%	45.5%	30.1%	31.2%	27.3%	75.5%	full
8	`anthropic/claude-sonnet-4.6`	38.3%	42.6%	29.2%	27.7%	18.3%	76.7%	full
9	`anthropic/claude-haiku-4.5`	37.9%	47.9%	22.0%	20.6%	20.5%	70.7%	full
10	`openai/gpt-oss-120b`	25.2%	22.2%	7.7%	18.7%	25.6%	74.2%	3/11 DNF [2]
11	`nvidia/nemotron-3-ultra-550b-a55b`	25.1%	30.2%	7.3%	26.9%	8.3%	67.6%	full

[1] Remaining tasks in flight at publication; composite covers listed tasks only. [2] gpt-oss-120b's 131k context cannot fit the ~190k-token data payloads on 8 of 11 tasks — reported as a finding, not ranked competitively.

Outcomes (trend analysis, single run)

Rank	Model	Composite	Grounding	Calibration	Model Cost
1	`openai/gpt-5.5@low`	90.1%	100.0%	59.3%	$0.17
2	`openai/gpt-5.5@medium`	90.0%	100.0%	55.7%	$0.21
3	`nvidia/nemotron-3-ultra-550b-a55b`	88.3%	100.0%	45.8%	$0.04
4	`openai/gpt-5.5@xhigh`	86.9%	100.0%	45.0%	$0.67
5	`openai/gpt-5.5@high`	86.6%	100.0%	39.2%	$0.51
6	`anthropic/claude-opus-4.8`	86.2%	100.0%	46.0%	$0.30
7	`openai/gpt-oss-120b`	84.0%	100.0%	39.9%	$0.00
8	`anthropic/claude-sonnet-4.6`	83.0%	100.0%	43.6%	$0.14
9	`google/gemini-3.5-flash`	81.2%	100.0%	29.9%	$0.17
10	`google/gemini-3.1-pro-preview`	80.5%	100.0%	46.1%	$0.17
11	`anthropic/claude-haiku-4.5`	76.7%	90.0%	43.8%	$0.05

Equity research (subgroup & research reasoning, single run)

Rank	Model	Composite	Grounding	Calibration	Model Cost
1	`openai/gpt-5.5@medium`	83.0%	79.8%	58.9%	$1.10
2	`openai/gpt-5.5@xhigh`	82.8%	81.5%	53.1%	$2.45
3	`nvidia/nemotron-3-ultra-550b-a55b`	81.8%	78.5%	52.8%	$0.10
4	`openai/gpt-5.5@high`	81.1%	77.7%	52.8%	$1.97
5	`openai/gpt-5.5@low`	80.6%	76.0%	58.3%	$0.64
6	`anthropic/claude-opus-4.8`	75.4%	69.8%	49.7%	$0.95
7	`google/gemini-3.1-pro-preview`	73.8%	67.8%	48.3%	$0.59
8	`google/gemini-3.5-flash`	73.0%	67.8%	39.2%	$0.53
9	`openai/gpt-oss-120b`	70.0%	62.3%	32.8%	$0.01
10	`anthropic/claude-sonnet-4.6`	68.7%	55.8%	42.8%	$0.47
11	`anthropic/claude-haiku-4.5`	66.9%	66.0%	35.8%	$0.17

Composite = 35% Grounding + 20% Insight + 15% Evidence + 15% Calibration + 10% Consistency + 5% Structure (consistency excluded and weights renormalized on single-run packs).

Key findings. (1) Reasoning effort pays only where tasks are hard: GPT-5.5's effort curve rises steeply from low→medium on heavy data analysis (52%→65%) then saturates, and inverts on light trend summaries (low beats xhigh). (2) Scores are envelope-relative: under a 4k token budget the default-thinking models (Gemini, Nemotron) score near the floor because reasoning consumes the budget before any answer appears — the same models are mid-pack or better at 64k. Nemotron still exhausts even 64k on the largest tasks (it reasons by exhaustive enumeration), so its operations score reflects budget economics, not analytical ability — it places top-three on both smaller packs. (3) Accuracy and stability trade off: GPT-5.5 leads grounding accuracy while Claude models lead run-to-run consistency; Claude Haiku 4.5 delivers ~90% of Opus's operations composite at ~18% of its cost. (4) Calibration is universally weak (18–31%): no model reliably acknowledges data limitations without prompting — the clearest open problem this benchmark surfaces.

What GRADE Measures

Five tracks across three synthetic fixture packs — 26 tasks total.

Track	Tasks	Core challenge
1 · Grounded Retrieval & Computation	6	Exact counts, rates, and date-filtered aggregates from CSVs — no inference required; the only failure mode is hallucination
2 · Program Snapshot & Trend Interpretation	5	Month-over-month summaries and anomaly detection with epistemic discipline: the data contains correlations that don't support causal claims
3 · Operational Coaching & Recommendations	5	Evidence-backed recommendations with entity IDs, numeric rates, and source-file citations — generic advice scores zero
4 · Equity & Subgroup Interpretation	5	Suppressed low-N groups, attendance and outcome disparities, and plain-language translation for practitioners
5 · Program Effectiveness & Research Reasoning	5	Synthesis of local fixture data with external research references, including contradictory findings

Six Scoring Dimensions

Every response is scored on six dimensions combined into a weighted composite:

Dimension	Weight	How it's scored
Grounding Accuracy	35%	Deterministic: structured numeric outputs vs. gold facts with tolerance
Insight Quality	20%	LLM judge: coverage and depth of expected analytical findings
Evidence Linkage	15%	LLM judge: are claims traced to specific files and columns?
Calibration & Limitation Handling	15%	Deterministic + judge fallback: required caveats present, forbidden claims absent
Consistency	10%	Automated: finding/ranking/metric stability across 5 repeated runs
Structure & Usability	5%	LLM judge: format clarity and scannability

Quick Start

No API key required to validate your installation.

Requirements: Python 3.12+

git clone https://github.com/PearlEng/grade.git
cd grade
pip install -e .

# Smoke test with the built-in stub adapter (network-free, deterministic)
python -m runner.cli \
    --pack operations \
    --adapter stub \
    --runs 1 \
    --out /tmp/grade_out

Produces raw_outputs.jsonl (per-run model outputs) and result.json (aggregated scorecard) under /tmp/grade_out.

Run a real model

pip install -e ".[openrouter]"
export OPENROUTER_API_KEY="sk-or-..."

python -m runner.cli \
    --pack operations \
    --adapter openrouter \
    --model openai/gpt-4o \
    --runs 5 \
    --out /tmp/grade_real

Any model available on OpenRouter works. A full 26-task × 5-run evaluation is 130 model calls and typically completes in under two hours.

Render a scorecard

python -m benchmark.reports.scorecard \
    --results-dir /tmp/grade_real \
    --out /tmp/grade_reports
# → /tmp/grade_reports/scorecard.md  (Markdown, per-track breakdown)
# → /tmp/grade_reports/result.json   (machine-readable, result_schema)

For a full walkthrough — running all three packs, writing a custom adapter, and interpreting track profiles — see examples/quickstart.md.

How It Works

Synthetic Fixtures

All data is generated from a deterministic seed (42) and represents a fictional high-dosage tutoring program. Three packs are layered:

Pack	ID	Tracks	What's included
Operations	`pack_operations`	1, 3	Students, sessions, tutors, attendance, survey responses
Outcomes	`pack_outcomes`	2	All of Operations + month-over-month summary tables
Equity & Research	`pack_equity_research`	4, 5	All of Outcomes + subgroup summaries + 10 research references

No real students, schools, or programs are represented. Dataset cards: docs/dataset_cards/.

Scoring Pipeline

For each (model × task) pair:

Adapter calls the model N times (default: 5) with the task prompt and fixture context
C1 — Grounding compares structured_metrics in the response against gold_facts with numeric tolerance
C2 — Rubric passes the response to a judge model for quality dimensions (insight, evidence, structure)
C3 — Claim validation checks deterministically that required limitations are present and forbidden claims are absent
C4 — Consistency measures finding/ranking/metric stability across the N runs
C9 — Aggregation rolls up to task, track, and overall composites and writes result.json

Cross-Family Judging

No judge ever evaluates a model from the same family. Claude-family models are judged by GPT-5.5 (xhigh reasoning effort); all others are judged by Claude Opus 4.8. This eliminates house-style bias in the 40% of the composite scored by an LLM judge.

Reproducibility

Gold fact values are derived from ground_truth.json files computed from the generated data — not from the generator's input constants. Because the generator applies stochastic noise, realized values differ from spec targets. GRADE mitigates benchmark contamination by regenerating fixtures (new seed, recomputed ground_truth.json) for each numbered version; scores are only comparable within a version.

Pluggable Adapters

Adapters implement a minimal Protocol:

class Adapter(Protocol):
    name: str
    def run(self, task: dict, run_index: int = 0) -> dict: ...

Adapter	Flag	Purpose
Stub	`--adapter stub`	Deterministic, network-free — validates installation
OpenRouter	`--adapter openrouter`	Live inference for any OpenRouter model

To evaluate a model not on OpenRouter, implement the Protocol, register it in runner/cli.py, and pass --adapter your_name. See examples/quickstart.md.

Repository Layout

grade/
├── benchmark/
│   ├── datagen/        # Synthetic fixture generator (seeded, deterministic)
│   ├── reports/        # Scorecard and leaderboard renderers
│   ├── rubrics/        # Scorer implementations (C1–C4, C9)
│   ├── schemas/        # JSON Schema contracts (task, output, result)
│   └── tasks/          # 26 task definitions across 3 JSONL packs
├── runner/
│   ├── adapters/       # Adapter Protocol + built-in adapters
│   ├── cli.py          # Entry point: python -m runner.cli
│   ├── dispatcher.py   # Task loader and run coordinator
│   └── aggregator.py   # Result aggregation
├── fixtures/           # Pre-generated synthetic data (3 packs)
├── results/            # Pre-computed results from model runs
├── examples/           # Quickstart guide and sample outputs
└── docs/               # Methodology, scoring reference, dataset cards

Documentation

Document	Contents
docs/methodology.md	Mission, task taxonomy, scoring design, limitations, judging policy
docs/scoring.md	Full dimension definitions, aggregation rules, per-track rubric defaults
docs/dataset_cards/	File inventories, column contracts, intentional signal descriptions per pack
examples/quickstart.md	End-to-end walkthrough: install → smoke test → real model → scorecard → custom adapter
CONTRIBUTING.md	How to add tasks, fixture packs, and adapter implementations

Contributing

We welcome contributions — new tasks, fixture packs, adapter implementations, and methodology proposals. See CONTRIBUTING.md and GOVERNANCE.md.

About Pearl

GRADE is built and maintained by Pearl, an AI tutoring platform that partners with school districts to deliver high-dosage tutoring programs. Pearl works with program managers and district analysts daily, and GRADE reflects the analytical questions and failure modes we encounter in practice.

If you're a researcher, AI developer, or education technologist and want to discuss the benchmark or collaborate, reach out via GitHub Issues or at hello@tutorwithpearl.com.

Citing GRADE

If you use GRADE in your research, please cite the repository:

@software{pearl2026grade,
  author       = {Pearl},
  title        = {{GRADE}: Grounded Reasoning \& Analysis for Data in Education},
  year         = {2026},
  publisher    = {GitHub},
  version      = {0.1.0},
  url          = {https://github.com/PearlEng/grade},
  note         = {Open benchmark for AI evaluation in education analytics}
}

A Zenodo DOI for stable versioned citation will be added at the v0.1.0 release. Once available, prefer the DOI-based entry:

@software{pearl2026grade,
  author       = {Pearl},
  title        = {{GRADE}: Grounded Reasoning \& Analysis for Data in Education},
  year         = {2026},
  publisher    = {Zenodo},
  version      = {0.1.0},
  doi          = {10.5281/zenodo.XXXXXXX},
  url          = {https://doi.org/10.5281/zenodo.XXXXXXX}
}

GitHub also provides a formatted citation via the Cite this repository button in the sidebar (powered by CITATION.cff).

License

Apache 2.0. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
.github		.github
benchmark		benchmark
docs		docs
examples		examples
fixtures		fixtures
runner		runner
scripts		scripts
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CITATION.cff		CITATION.cff
CODEOWNERS		CODEOWNERS
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
GOVERNANCE.md		GOVERNANCE.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GRADE

The Problem

Leaderboard

Operations (program-data analysis, 5 repetitions)

Outcomes (trend analysis, single run)

Equity research (subgroup & research reasoning, single run)

What GRADE Measures

Six Scoring Dimensions

Quick Start

Run a real model

Render a scorecard

How It Works

Synthetic Fixtures

Scoring Pipeline

Cross-Family Judging

Reproducibility

Pluggable Adapters

Repository Layout

Documentation

Contributing

About Pearl

Citing GRADE

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

GRADE

The Problem

Leaderboard

Operations (program-data analysis, 5 repetitions)

Outcomes (trend analysis, single run)

Equity research (subgroup & research reasoning, single run)

What GRADE Measures

Six Scoring Dimensions

Quick Start

Run a real model

Render a scorecard

How It Works

Synthetic Fixtures

Scoring Pipeline

Cross-Family Judging

Reproducibility

Pluggable Adapters

Repository Layout

Documentation

Contributing

About Pearl

Citing GRADE

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages