Skip to content

fix(scoring,judge,adapter): methodology review fixes — criticals C-1..C-3, mediums M-1/M-2, highs H-1..H-5#41

Merged
jonathanbbechtel merged 6 commits into
mainfrom
fix/scoring-c1-limitations-c2-subset
Jun 10, 2026
Merged

fix(scoring,judge,adapter): methodology review fixes — criticals C-1..C-3, mediums M-1/M-2, highs H-1..H-5#41
jonathanbbechtel merged 6 commits into
mainfrom
fix/scoring-c1-limitations-c2-subset

Conversation

@jonathanbbechtel

Copy link
Copy Markdown
Collaborator

Summary

Lands the full methodology-review fix stack on main ahead of the official leaderboard run:

  • C-1: model registry corrected to live OpenRouter slugs + reasoning-effort passthrough
  • C-2/C-3: cross-family judge policy, rationale-then-score, no silent judge failures
  • M-1/M-2: C1 sees limitations; judge only scores C2-owned dimensions (halves judge spend)
  • H-1..H-5 (squashed via fix(scoring,adapter): close out high-severity methodology findings H-1..H-5 #40): grounding gated on claim context, non-numeric fact credit, consistency excluded from runs=1 composites, max_tokens 4096 + truncation surfacing, candidate reasoning effort via @<effort> model suffix

All findings in docs/methodology_review_findings.md are now FIXED or DISCLOSED.

🤖 Generated with Claude Code

JonathanBechtel and others added 5 commits June 10, 2026 16:18
…e/contamination disclosures

Adds docs/methodology_review_findings.md — the full pre-launch methodology
review of the evaluation harness (14 findings across critical/high/medium
with fix status and file references) — and extends docs/methodology.md with
three disclosure sections it calls for:

- Statistical precision: 26 tasks, per-track N as low as 5; per-track gaps
  within plausible noise
- Judge scope: the LLM judge scores against gold material, not raw fixture
  CSVs; numbers are verified by the deterministic C1 scorer
- Benchmark contamination: fixtures regenerate per numbered benchmark
  version; scores comparable only within a grade_version

Also documents the cross-family judge-selection policy in the C2 scoring
bullet (Opus 4.8 judges all non-Claude candidates; GPT-5.5 xhigh judges
Claude candidates) with its two residual caveats.
…ning-effort passthrough

SEED_MODELS mapped shorthands to stale placeholder slugs — "gpt-5" actually
called openai/gpt-4o, "gemini-2.5-pro" called google/gemini-pro-1.5, and
"claude-opus-4-7" called anthropic/claude-opus-4.5. A leaderboard published
from these would attribute GPT-4o scores to GPT-5 (methodology review C-1).
The adapter default also used a dash-form slug (claude-sonnet-4-5) that does
not exist on OpenRouter — live slugs use dots.

- Replace SEED_MODELS with the launch lineup, every slug verified against
  the live /api/v1/models endpoint on 2026-06-10: Claude Opus 4.8 /
  Sonnet 4.6 / Haiku 4.5, GPT-5.5, GPT-OSS 120B, Gemini 3.5 Flash,
  Gemini 3.1 Pro (preview), Nemotron 3 Ultra
- Default adapter model -> anthropic/claude-sonnet-4.6
- post_chat_completion gains an optional reasoning_effort parameter
  (forwarded as {"reasoning": {"effort": ...}}) for reasoning-capable
  models; needed by the GPT-5.5 judge and the candidate-side effort sweep
- Update stale slugs in README, quickstart, and adapter docstrings
…t failures

Closes out methodology review findings C-2, C-3, and M-6 — the judge could
silently produce meaningless or biased scores in three ways.

Null-judge guard (C-2): without --judge, the C2-owned dimensions (40% of the
composite) were a flat 0.5 placeholder for every model with nothing in the
result recording it.
- dispatcher stamps a "null_judge" flag into every task's scorer_flags when
  no live judge is configured
- runner.cli and scripts.run_models refuse the openrouter adapter without
  --judge unless --allow-null-judge is passed (stub/smoke runs unaffected)
- a judge-client construction failure in the batch runner now records the
  model as failed instead of silently downgrading to the null judge

Judge selection policy (C-3): no judge ever shares a model family with its
candidate, so house style cannot bias the judged dimensions. Claude Opus 4.8
(strongest available) judges every candidate except Claude-family models,
which GPT-5.5 at xhigh reasoning effort judges instead. Implemented in
select_judge_model(); both CLIs apply it per candidate and print the judge
in use; --judge-model overrides. The old default judge slug
(anthropic/claude-opus-4-5, dash form) did not exist on OpenRouter — every
live judge call would have failed.

Rationale-then-score (M-6): the judge now justifies in 2-3 sentences and
ends with a "SCORE: <float>" line instead of replying with a bare float.
Parsing prefers the last SCORE: line and falls back to the last float, so
numbers quoted in the justification are never mistaken for the verdict.
A non-parseable reply retries once, then raises JudgeScoreError — surfacing
in failures.json instead of silently scoring the candidate 0.0. Non-reasoning
judge max_tokens 16 -> 384; reasoning judges get 8192 (reasoning tokens share
the budget with the visible answer).
Closes out methodology review findings M-1 and M-2.

M-1: _score_c1_grounding never passed the output's limitations list to
score_facts, even though the scorer documents it as a fallback search
target. A gold-fact value stated only in a caveat ("only 977 students are
reflected in this snapshot") was scored as not found. Now passed through.

M-2: the C2 loop judged all six rubric dimensions per run, then three
(grounding_accuracy, calibration_limitation_handling, consistency) were
discarded and overridden by C1/C3/C4 — 390 wasted judge calls per model at
26 tasks x 5 runs. score_rubric gains a dimensions parameter (full-rubric
weight validation unchanged; unknown names raise before any judge call) and
the dispatcher requests only the three C2-owned dimensions, halving judge
cost. Default score_rubric behavior (all six) is unchanged for existing
callers.
…1..H-5 (#40)

* fix(scoring): gate free-text numeric matches on claim context (H-1)

Grounding accuracy credited a gold fact when ANY number anywhere in the
output (up to 50 prose sentences, plus a /100 percent-normalized variant of
every candidate) fell within tolerance. With the leniency floors, an
unrelated "29.6%" anywhere in the response credited a gold count of 30 —
rewarding number-dense outputs regardless of relevance on the
highest-weighted dimension.

Free-text numeric matches (key_findings / limitations) now require the
containing sentence to share at least one content token with the gold claim
(stopwords and numbers excluded, trailing plural-s normalized, hyphenated
IDs like sch-001 preserved). An empty claim disables the gate.

structured_metrics matching deliberately stays key-agnostic: structured
values are deliberate model assertions, and key-name matching was previously
found too brittle against real model outputs — the spam guard targets prose
only.

* fix(scoring): credit non-numeric facts via substring or token overlap (H-2)

Non-numeric gold facts were located by substring containment but then
re-scored with exact string equality — so a finding containing the claim
plus any other words scored 0.0, making non-numeric facts near-universal
misses unless the model echoed the claim verbatim. The module docstring
promised token-overlap matching that was never implemented.

The located match is now scored directly: stage 1 is substring containment
(either direction, normalized); stage 2 credits paraphrases when >= 50% of
the claim's content tokens appear in the entry (TEXT_FACT_OVERLAP_THRESHOLD,
mirroring C3's claim-validation threshold). Match provenance is recorded in
the method label (text_match[key_findings+token_overlap] etc.).

* fix(scoring): exclude trivial consistency from single-run composites (H-3)

With runs=1, all three C4 consistency sub-metrics trivially default to 1.0 —
not measured consistency, just free credit (typically 10% of the composite)
handed to every model equally, inflating absolute scores.

When runs < 2 the dispatcher now stamps a consistency_trivial scorer flag
and computes the composite with the consistency dimension excluded and the
remaining weights renormalized, keeping single-run composites on the same
[0, 1] scale. The per-dimension consistency score is still reported for
schema completeness. Multi-run behavior is unchanged.

* fix(adapter): raise max_tokens to 4096 + surface truncation via finish_reason (H-4)

DEFAULT_MAX_TOKENS=1024 routinely truncated the multi-section prose analyses
the prompt asks for. Limitations sections come last, so the cut silently
deflated calibration_limitation_handling — and nothing recorded that
truncation had happened.

- DEFAULT_MAX_TOKENS 1024 -> 4096 (GRADE_OPENROUTER_MAX_TOKENS env var and
  constructor override unchanged)
- The adapter now records the provider's finish_reason in runtime_metadata;
  output_schema.json gains the optional nullable field
- The dispatcher stamps a "truncated_output" scorer flag when any run
  finishes with "length", making truncation visible in the scorecard

* feat(adapter): candidate-side reasoning effort via @<effort> model suffix (H-5)

The launch plan runs GPT-5.5 at four reasoning-effort levels, but the
adapter had no way to set an effort and no way to keep the four result sets
distinct.

- OpenRouterAdapter accepts reasoning_effort= and an inline @<effort> model
  suffix (e.g. "openai/gpt-5.5@xhigh"); the suffix is stripped from the API
  slug but kept in the reported model_id so each effort level gets its own
  leaderboard row
- Reasoning payload forwarded as {"reasoning": {"effort": ...}}
- Reasoning runs default to a 16384-token budget
  (DEFAULT_REASONING_MAX_TOKENS) — reasoning tokens share the budget with
  the visible analysis, so the 4096 prose default would yield empty
  responses at high effort
- @<effort> syntax documented in --model (runner.cli) and --models
  (scripts.run_models) help; the sweep is now
  --models openai/gpt-5.5@xhigh,...,openai/gpt-5.5@low

---------

Co-authored-by: jonathan bechtel <jonathanbechtel@gmail.com>
@pearleng-atlantis

Copy link
Copy Markdown
Error: This repo is not allowlisted for Atlantis.

…th FIXED statuses

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@jonathanbbechtel jonathanbbechtel merged commit f40c18f into main Jun 10, 2026
2 checks passed
@jonathanbbechtel jonathanbbechtel deleted the fix/scoring-c1-limitations-c2-subset branch June 10, 2026 21:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants