Skip to content

Feature Request: Add Judge Calibration Diagnostics for LLM-as-Judge Evaluation #2392

@elandesberg

Description

@elandesberg

Feature Request: Add Judge Calibration Diagnostics for LLM-as-Judge Evaluation

Hi OpenCompass team — thanks for shipping a really practical LLM eval platform.

OpenCompass already supports LLM-as-judge evaluation via GenericLLMEvaluator and "subjective evaluation" (Compare Mode / Score Mode), with bootstrap confidence intervals for Bradley-Terry rankings (compass_arena_bradley_terry.py). This provides excellent uncertainty quantification for model rankings.

One thing I think could be strengthened is calibration + valid uncertainty quantification specifically for judge quality:

  1. Calibrate judge outputs to the real target (human preference / expert judgment / downstream KPI proxy), and
  2. Report directional surrogacy metrics that validate whether judge rankings match human rankings.

Why this matters

In practice, raw judge scores / win-rates are an uncalibrated proxy and can:

  • invert rankings (proxy-goodhart),
  • drift across time / domains / prompt mixes,
  • produce misleading confidence intervals when the judge is miscalibrated.

Standard bootstrapping of judge outputs treats them as ground truth, which produces overconfident intervals when the judge itself is miscalibrated. Bootstrap CIs answer "How certain are we about model A > model B?" but don't answer "Does the judge's ranking match humans'?" — these are complementary questions.

One approach that addresses this is Causal Judge Evaluation (CJE), which we recently released. It treats the judge as a surrogate and uses a small "oracle slice" (human/expert labels or a higher-quality evaluator) to (i) test calibration stability and (ii) produce calibrated estimates + valid uncertainty:

Concrete proposal (minimal + optional)

Add an optional "calibrated judge" reporting path for LLM-as-judge / subjective eval that:

  • takes judge outputs over the full dataset (cheap)
  • takes oracle labels over a subset (expensive), where "oracle" just means your ground truth for this task — could be:
    • human preference (true subjective eval), or
    • a stricter rule-based evaluator / reference-based evaluator (objective tasks), or
    • a stronger model / ensemble (when that's the best available proxy)
  • fits a calibration / surrogate model and returns:
    • calibrated metric estimate (e.g., calibrated win-rate / quality score)
    • directional surrogacy metrics (Kendall's tau, Spearman's rho, Fleiss' kappa)
    • calibration diagnostics (reliability diagrams, agreement by category)
    • a diagnostic that flags when the judge is not stable enough to trust for that run

Example: Suppose GPT-4 ranks Model A > Model B with 60% win rate, but when you validate on 200 human labels, humans prefer B > A. CJE would detect this miscalibration and either (a) produce a corrected estimate or (b) flag that the judge is too unstable to trust for this comparison.

This could be implemented either as:

Option A — post-processing step (lowest friction):

  • a small utility (e.g. opencompass/analysis/cje.py) that reads OpenCompass output artifacts (judge outputs + optional oracle file) and emits report_cje.csv alongside the existing report.csv.
  • Since GenericLLMEvaluator already outputs per-sample judge scores in the details field, this is technically feasible as a post-processing step.

Option B — evaluator wrapper (deeper integration):

  • a new evaluator/wrapper that runs:
    • oracle_evaluator on a sampled calibration slice
    • GenericLLMEvaluator (or existing subjective judge evaluator) on all samples
    • then computes calibrated estimate + uncertainty for the run

Where it fits in OpenCompass today

  • You already highlight that subjective eval is expensive and that JudgeLLM is used "as a substitute for human assessors" — CJE is specifically about making that substitution statistically safe (and telling you when it's failing).
  • You already have a CascadeEvaluator pattern for blending rule-based + LLM evaluation. CJE would be a natural extension: use the "rule-based (or human) slice" to calibrate/validate the LLM-judge, not just override it.
  • You already have excellent bootstrap uncertainty for rankings. CJE would complement this by adding judge calibration diagnostics.

This would be especially valuable for users benchmarking multiple models on subjective tasks (e.g., creative writing, helpfulness) where human eval is the gold standard but too expensive to run comprehensively.

Happy to contribute

If this direction is interesting, I'm happy to:

  • propose an initial interface (config fields + output schema), and/or
  • open a PR for a post-processing utility (Option A) as a first step.

If Option A looks good, I could have a draft PR within 2 weeks.

Questions for maintainers:

  1. Would you prefer a post-processing tool first, or a core evaluator wrapper?
  2. For subjective eval Compare Mode: do you already store per-question judge decisions in a stable JSON format that we can ingest?

(For reference: Semantic Scholar / GitHub search in this repo didn't turn up an existing issue about "calibration / directional surrogacy" for LLM-judge reporting.)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions