fix(scoring,judge,adapter): methodology review fixes — criticals C-1..C-3, mediums M-1/M-2, highs H-1..H-5#41
Merged
Conversation
…e/contamination disclosures Adds docs/methodology_review_findings.md — the full pre-launch methodology review of the evaluation harness (14 findings across critical/high/medium with fix status and file references) — and extends docs/methodology.md with three disclosure sections it calls for: - Statistical precision: 26 tasks, per-track N as low as 5; per-track gaps within plausible noise - Judge scope: the LLM judge scores against gold material, not raw fixture CSVs; numbers are verified by the deterministic C1 scorer - Benchmark contamination: fixtures regenerate per numbered benchmark version; scores comparable only within a grade_version Also documents the cross-family judge-selection policy in the C2 scoring bullet (Opus 4.8 judges all non-Claude candidates; GPT-5.5 xhigh judges Claude candidates) with its two residual caveats.
…ning-effort passthrough
SEED_MODELS mapped shorthands to stale placeholder slugs — "gpt-5" actually
called openai/gpt-4o, "gemini-2.5-pro" called google/gemini-pro-1.5, and
"claude-opus-4-7" called anthropic/claude-opus-4.5. A leaderboard published
from these would attribute GPT-4o scores to GPT-5 (methodology review C-1).
The adapter default also used a dash-form slug (claude-sonnet-4-5) that does
not exist on OpenRouter — live slugs use dots.
- Replace SEED_MODELS with the launch lineup, every slug verified against
the live /api/v1/models endpoint on 2026-06-10: Claude Opus 4.8 /
Sonnet 4.6 / Haiku 4.5, GPT-5.5, GPT-OSS 120B, Gemini 3.5 Flash,
Gemini 3.1 Pro (preview), Nemotron 3 Ultra
- Default adapter model -> anthropic/claude-sonnet-4.6
- post_chat_completion gains an optional reasoning_effort parameter
(forwarded as {"reasoning": {"effort": ...}}) for reasoning-capable
models; needed by the GPT-5.5 judge and the candidate-side effort sweep
- Update stale slugs in README, quickstart, and adapter docstrings
…t failures Closes out methodology review findings C-2, C-3, and M-6 — the judge could silently produce meaningless or biased scores in three ways. Null-judge guard (C-2): without --judge, the C2-owned dimensions (40% of the composite) were a flat 0.5 placeholder for every model with nothing in the result recording it. - dispatcher stamps a "null_judge" flag into every task's scorer_flags when no live judge is configured - runner.cli and scripts.run_models refuse the openrouter adapter without --judge unless --allow-null-judge is passed (stub/smoke runs unaffected) - a judge-client construction failure in the batch runner now records the model as failed instead of silently downgrading to the null judge Judge selection policy (C-3): no judge ever shares a model family with its candidate, so house style cannot bias the judged dimensions. Claude Opus 4.8 (strongest available) judges every candidate except Claude-family models, which GPT-5.5 at xhigh reasoning effort judges instead. Implemented in select_judge_model(); both CLIs apply it per candidate and print the judge in use; --judge-model overrides. The old default judge slug (anthropic/claude-opus-4-5, dash form) did not exist on OpenRouter — every live judge call would have failed. Rationale-then-score (M-6): the judge now justifies in 2-3 sentences and ends with a "SCORE: <float>" line instead of replying with a bare float. Parsing prefers the last SCORE: line and falls back to the last float, so numbers quoted in the justification are never mistaken for the verdict. A non-parseable reply retries once, then raises JudgeScoreError — surfacing in failures.json instead of silently scoring the candidate 0.0. Non-reasoning judge max_tokens 16 -> 384; reasoning judges get 8192 (reasoning tokens share the budget with the visible answer).
Closes out methodology review findings M-1 and M-2.
M-1: _score_c1_grounding never passed the output's limitations list to
score_facts, even though the scorer documents it as a fallback search
target. A gold-fact value stated only in a caveat ("only 977 students are
reflected in this snapshot") was scored as not found. Now passed through.
M-2: the C2 loop judged all six rubric dimensions per run, then three
(grounding_accuracy, calibration_limitation_handling, consistency) were
discarded and overridden by C1/C3/C4 — 390 wasted judge calls per model at
26 tasks x 5 runs. score_rubric gains a dimensions parameter (full-rubric
weight validation unchanged; unknown names raise before any judge call) and
the dispatcher requests only the three C2-owned dimensions, halving judge
cost. Default score_rubric behavior (all six) is unchanged for existing
callers.
…1..H-5 (#40) * fix(scoring): gate free-text numeric matches on claim context (H-1) Grounding accuracy credited a gold fact when ANY number anywhere in the output (up to 50 prose sentences, plus a /100 percent-normalized variant of every candidate) fell within tolerance. With the leniency floors, an unrelated "29.6%" anywhere in the response credited a gold count of 30 — rewarding number-dense outputs regardless of relevance on the highest-weighted dimension. Free-text numeric matches (key_findings / limitations) now require the containing sentence to share at least one content token with the gold claim (stopwords and numbers excluded, trailing plural-s normalized, hyphenated IDs like sch-001 preserved). An empty claim disables the gate. structured_metrics matching deliberately stays key-agnostic: structured values are deliberate model assertions, and key-name matching was previously found too brittle against real model outputs — the spam guard targets prose only. * fix(scoring): credit non-numeric facts via substring or token overlap (H-2) Non-numeric gold facts were located by substring containment but then re-scored with exact string equality — so a finding containing the claim plus any other words scored 0.0, making non-numeric facts near-universal misses unless the model echoed the claim verbatim. The module docstring promised token-overlap matching that was never implemented. The located match is now scored directly: stage 1 is substring containment (either direction, normalized); stage 2 credits paraphrases when >= 50% of the claim's content tokens appear in the entry (TEXT_FACT_OVERLAP_THRESHOLD, mirroring C3's claim-validation threshold). Match provenance is recorded in the method label (text_match[key_findings+token_overlap] etc.). * fix(scoring): exclude trivial consistency from single-run composites (H-3) With runs=1, all three C4 consistency sub-metrics trivially default to 1.0 — not measured consistency, just free credit (typically 10% of the composite) handed to every model equally, inflating absolute scores. When runs < 2 the dispatcher now stamps a consistency_trivial scorer flag and computes the composite with the consistency dimension excluded and the remaining weights renormalized, keeping single-run composites on the same [0, 1] scale. The per-dimension consistency score is still reported for schema completeness. Multi-run behavior is unchanged. * fix(adapter): raise max_tokens to 4096 + surface truncation via finish_reason (H-4) DEFAULT_MAX_TOKENS=1024 routinely truncated the multi-section prose analyses the prompt asks for. Limitations sections come last, so the cut silently deflated calibration_limitation_handling — and nothing recorded that truncation had happened. - DEFAULT_MAX_TOKENS 1024 -> 4096 (GRADE_OPENROUTER_MAX_TOKENS env var and constructor override unchanged) - The adapter now records the provider's finish_reason in runtime_metadata; output_schema.json gains the optional nullable field - The dispatcher stamps a "truncated_output" scorer flag when any run finishes with "length", making truncation visible in the scorecard * feat(adapter): candidate-side reasoning effort via @<effort> model suffix (H-5) The launch plan runs GPT-5.5 at four reasoning-effort levels, but the adapter had no way to set an effort and no way to keep the four result sets distinct. - OpenRouterAdapter accepts reasoning_effort= and an inline @<effort> model suffix (e.g. "openai/gpt-5.5@xhigh"); the suffix is stripped from the API slug but kept in the reported model_id so each effort level gets its own leaderboard row - Reasoning payload forwarded as {"reasoning": {"effort": ...}} - Reasoning runs default to a 16384-token budget (DEFAULT_REASONING_MAX_TOKENS) — reasoning tokens share the budget with the visible analysis, so the 4096 prose default would yield empty responses at high effort - @<effort> syntax documented in --model (runner.cli) and --models (scripts.run_models) help; the sweep is now --models openai/gpt-5.5@xhigh,...,openai/gpt-5.5@low --------- Co-authored-by: jonathan bechtel <jonathanbechtel@gmail.com>
|
…th FIXED statuses Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Lands the full methodology-review fix stack on main ahead of the official leaderboard run:
limitations; judge only scores C2-owned dimensions (halves judge spend)@<effort>model suffixAll findings in
docs/methodology_review_findings.mdare now FIXED or DISCLOSED.🤖 Generated with Claude Code