fix(scoring,judge,adapter): methodology review fixes — criticals C-1..C-3, mediums M-1/M-2, highs H-1..H-5 by jonathanbbechtel · Pull Request #41 · PearlEng/grade

jonathanbbechtel · 2026-06-10T21:06:46Z

Summary

Lands the full methodology-review fix stack on main ahead of the official leaderboard run:

C-1: model registry corrected to live OpenRouter slugs + reasoning-effort passthrough
C-2/C-3: cross-family judge policy, rationale-then-score, no silent judge failures
M-1/M-2: C1 sees limitations; judge only scores C2-owned dimensions (halves judge spend)
H-1..H-5 (squashed via fix(scoring,adapter): close out high-severity methodology findings H-1..H-5 #40): grounding gated on claim context, non-numeric fact credit, consistency excluded from runs=1 composites, max_tokens 4096 + truncation surfacing, candidate reasoning effort via @<effort> model suffix

All findings in docs/methodology_review_findings.md are now FIXED or DISCLOSED.

🤖 Generated with Claude Code

…e/contamination disclosures Adds docs/methodology_review_findings.md — the full pre-launch methodology review of the evaluation harness (14 findings across critical/high/medium with fix status and file references) — and extends docs/methodology.md with three disclosure sections it calls for: - Statistical precision: 26 tasks, per-track N as low as 5; per-track gaps within plausible noise - Judge scope: the LLM judge scores against gold material, not raw fixture CSVs; numbers are verified by the deterministic C1 scorer - Benchmark contamination: fixtures regenerate per numbered benchmark version; scores comparable only within a grade_version Also documents the cross-family judge-selection policy in the C2 scoring bullet (Opus 4.8 judges all non-Claude candidates; GPT-5.5 xhigh judges Claude candidates) with its two residual caveats.

…ning-effort passthrough SEED_MODELS mapped shorthands to stale placeholder slugs — "gpt-5" actually called openai/gpt-4o, "gemini-2.5-pro" called google/gemini-pro-1.5, and "claude-opus-4-7" called anthropic/claude-opus-4.5. A leaderboard published from these would attribute GPT-4o scores to GPT-5 (methodology review C-1). The adapter default also used a dash-form slug (claude-sonnet-4-5) that does not exist on OpenRouter — live slugs use dots. - Replace SEED_MODELS with the launch lineup, every slug verified against the live /api/v1/models endpoint on 2026-06-10: Claude Opus 4.8 / Sonnet 4.6 / Haiku 4.5, GPT-5.5, GPT-OSS 120B, Gemini 3.5 Flash, Gemini 3.1 Pro (preview), Nemotron 3 Ultra - Default adapter model -> anthropic/claude-sonnet-4.6 - post_chat_completion gains an optional reasoning_effort parameter (forwarded as {"reasoning": {"effort": ...}}) for reasoning-capable models; needed by the GPT-5.5 judge and the candidate-side effort sweep - Update stale slugs in README, quickstart, and adapter docstrings

…t failures Closes out methodology review findings C-2, C-3, and M-6 — the judge could silently produce meaningless or biased scores in three ways. Null-judge guard (C-2): without --judge, the C2-owned dimensions (40% of the composite) were a flat 0.5 placeholder for every model with nothing in the result recording it. - dispatcher stamps a "null_judge" flag into every task's scorer_flags when no live judge is configured - runner.cli and scripts.run_models refuse the openrouter adapter without --judge unless --allow-null-judge is passed (stub/smoke runs unaffected) - a judge-client construction failure in the batch runner now records the model as failed instead of silently downgrading to the null judge Judge selection policy (C-3): no judge ever shares a model family with its candidate, so house style cannot bias the judged dimensions. Claude Opus 4.8 (strongest available) judges every candidate except Claude-family models, which GPT-5.5 at xhigh reasoning effort judges instead. Implemented in select_judge_model(); both CLIs apply it per candidate and print the judge in use; --judge-model overrides. The old default judge slug (anthropic/claude-opus-4-5, dash form) did not exist on OpenRouter — every live judge call would have failed. Rationale-then-score (M-6): the judge now justifies in 2-3 sentences and ends with a "SCORE: <float>" line instead of replying with a bare float. Parsing prefers the last SCORE: line and falls back to the last float, so numbers quoted in the justification are never mistaken for the verdict. A non-parseable reply retries once, then raises JudgeScoreError — surfacing in failures.json instead of silently scoring the candidate 0.0. Non-reasoning judge max_tokens 16 -> 384; reasoning judges get 8192 (reasoning tokens share the budget with the visible answer).

Closes out methodology review findings M-1 and M-2. M-1: _score_c1_grounding never passed the output's limitations list to score_facts, even though the scorer documents it as a fallback search target. A gold-fact value stated only in a caveat ("only 977 students are reflected in this snapshot") was scored as not found. Now passed through. M-2: the C2 loop judged all six rubric dimensions per run, then three (grounding_accuracy, calibration_limitation_handling, consistency) were discarded and overridden by C1/C3/C4 — 390 wasted judge calls per model at 26 tasks x 5 runs. score_rubric gains a dimensions parameter (full-rubric weight validation unchanged; unknown names raise before any judge call) and the dispatcher requests only the three C2-owned dimensions, halving judge cost. Default score_rubric behavior (all six) is unchanged for existing callers.

…1..H-5 (#40) * fix(scoring): gate free-text numeric matches on claim context (H-1) Grounding accuracy credited a gold fact when ANY number anywhere in the output (up to 50 prose sentences, plus a /100 percent-normalized variant of every candidate) fell within tolerance. With the leniency floors, an unrelated "29.6%" anywhere in the response credited a gold count of 30 — rewarding number-dense outputs regardless of relevance on the highest-weighted dimension. Free-text numeric matches (key_findings / limitations) now require the containing sentence to share at least one content token with the gold claim (stopwords and numbers excluded, trailing plural-s normalized, hyphenated IDs like sch-001 preserved). An empty claim disables the gate. structured_metrics matching deliberately stays key-agnostic: structured values are deliberate model assertions, and key-name matching was previously found too brittle against real model outputs — the spam guard targets prose only. * fix(scoring): credit non-numeric facts via substring or token overlap (H-2) Non-numeric gold facts were located by substring containment but then re-scored with exact string equality — so a finding containing the claim plus any other words scored 0.0, making non-numeric facts near-universal misses unless the model echoed the claim verbatim. The module docstring promised token-overlap matching that was never implemented. The located match is now scored directly: stage 1 is substring containment (either direction, normalized); stage 2 credits paraphrases when >= 50% of the claim's content tokens appear in the entry (TEXT_FACT_OVERLAP_THRESHOLD, mirroring C3's claim-validation threshold). Match provenance is recorded in the method label (text_match[key_findings+token_overlap] etc.). * fix(scoring): exclude trivial consistency from single-run composites (H-3) With runs=1, all three C4 consistency sub-metrics trivially default to 1.0 — not measured consistency, just free credit (typically 10% of the composite) handed to every model equally, inflating absolute scores. When runs < 2 the dispatcher now stamps a consistency_trivial scorer flag and computes the composite with the consistency dimension excluded and the remaining weights renormalized, keeping single-run composites on the same [0, 1] scale. The per-dimension consistency score is still reported for schema completeness. Multi-run behavior is unchanged. * fix(adapter): raise max_tokens to 4096 + surface truncation via finish_reason (H-4) DEFAULT_MAX_TOKENS=1024 routinely truncated the multi-section prose analyses the prompt asks for. Limitations sections come last, so the cut silently deflated calibration_limitation_handling — and nothing recorded that truncation had happened. - DEFAULT_MAX_TOKENS 1024 -> 4096 (GRADE_OPENROUTER_MAX_TOKENS env var and constructor override unchanged) - The adapter now records the provider's finish_reason in runtime_metadata; output_schema.json gains the optional nullable field - The dispatcher stamps a "truncated_output" scorer flag when any run finishes with "length", making truncation visible in the scorecard * feat(adapter): candidate-side reasoning effort via @<effort> model suffix (H-5) The launch plan runs GPT-5.5 at four reasoning-effort levels, but the adapter had no way to set an effort and no way to keep the four result sets distinct. - OpenRouterAdapter accepts reasoning_effort= and an inline @<effort> model suffix (e.g. "openai/gpt-5.5@xhigh"); the suffix is stripped from the API slug but kept in the reported model_id so each effort level gets its own leaderboard row - Reasoning payload forwarded as {"reasoning": {"effort": ...}} - Reasoning runs default to a 16384-token budget (DEFAULT_REASONING_MAX_TOKENS) — reasoning tokens share the budget with the visible analysis, so the 4096 prose default would yield empty responses at high effort - @<effort> syntax documented in --model (runner.cli) and --models (scripts.run_models) help; the sweep is now --models openai/gpt-5.5@xhigh,...,openai/gpt-5.5@low --------- Co-authored-by: jonathan bechtel <jonathanbechtel@gmail.com>

pearleng-atlantis · 2026-06-10T21:06:49Z

Error: This repo is not allowlisted for Atlantis.

…th FIXED statuses Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

JonathanBechtel and others added 5 commits June 10, 2026 16:18

Merge origin/main (PR #36 docs squash) — keep updated findings doc wi…

7669ad2

…th FIXED statuses Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

jonathanbbechtel merged commit f40c18f into main Jun 10, 2026
2 checks passed

jonathanbbechtel deleted the fix/scoring-c1-limitations-c2-subset branch June 10, 2026 21:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(scoring,judge,adapter): methodology review fixes — criticals C-1..C-3, mediums M-1/M-2, highs H-1..H-5#41

fix(scoring,judge,adapter): methodology review fixes — criticals C-1..C-3, mediums M-1/M-2, highs H-1..H-5#41
jonathanbbechtel merged 6 commits into
mainfrom
fix/scoring-c1-limitations-c2-subset

jonathanbbechtel commented Jun 10, 2026

Uh oh!

pearleng-atlantis Bot commented Jun 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

jonathanbbechtel commented Jun 10, 2026

Summary

Uh oh!

pearleng-atlantis Bot commented Jun 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants