Feature Request: Add Judge Calibration Diagnostics for LLM-as-Judge Evaluation

# Feature Request: Add Judge Calibration Diagnostics for LLM-as-Judge Evaluation

Hi OpenCompass team — thanks for shipping a really practical LLM eval platform.

OpenCompass already supports **LLM-as-judge** evaluation via `GenericLLMEvaluator` and "subjective evaluation" (Compare Mode / Score Mode), with **bootstrap confidence intervals for Bradley-Terry rankings** (`compass_arena_bradley_terry.py`). This provides excellent uncertainty quantification for *model rankings*.

One thing I think could be strengthened is **calibration + valid uncertainty quantification** specifically for *judge quality*:

1) **Calibrate** judge outputs to the real target (human preference / expert judgment / downstream KPI proxy), and
2) Report **directional surrogacy metrics** that validate whether judge rankings match human rankings.

### Why this matters

In practice, raw judge scores / win-rates are an **uncalibrated proxy** and can:
- invert rankings (proxy-goodhart),
- drift across time / domains / prompt mixes,
- produce misleading confidence intervals when the judge is miscalibrated.

Standard bootstrapping of judge outputs treats them as ground truth, which produces overconfident intervals when the judge itself is miscalibrated. Bootstrap CIs answer "How certain are we about model A > model B?" but don't answer "Does the judge's ranking match humans'?" — these are complementary questions.

One approach that addresses this is **Causal Judge Evaluation (CJE)**, which we recently released. It treats the judge as a *surrogate* and uses a small "oracle slice" (human/expert labels or a higher-quality evaluator) to (i) test calibration stability and (ii) produce calibrated estimates + valid uncertainty:
- Paper: https://arxiv.org/abs/2512.11150
- Package: `pip install cje-eval`

### Concrete proposal (minimal + optional)

Add an *optional* "calibrated judge" reporting path for LLM-as-judge / subjective eval that:

- takes **judge outputs** over the full dataset (cheap)
- takes **oracle labels** over a subset (expensive), where "oracle" just means your ground truth for this task — could be:
  - human preference (true subjective eval), or
  - a stricter rule-based evaluator / reference-based evaluator (objective tasks), or
  - a stronger model / ensemble (when that's the best available proxy)
- fits a calibration / surrogate model and returns:
  - calibrated metric estimate (e.g., calibrated win-rate / quality score)
  - **directional surrogacy metrics** (Kendall's tau, Spearman's rho, Fleiss' kappa)
  - **calibration diagnostics** (reliability diagrams, agreement by category)
  - a diagnostic that flags when the judge is *not stable enough* to trust for that run

**Example:** Suppose GPT-4 ranks Model A > Model B with 60% win rate, but when you validate on 200 human labels, humans prefer B > A. CJE would detect this miscalibration and either (a) produce a corrected estimate or (b) flag that the judge is too unstable to trust for this comparison.

This could be implemented either as:

**Option A — post-processing step (lowest friction):**
- a small utility (e.g. `opencompass/analysis/cje.py`) that reads OpenCompass output artifacts (judge outputs + optional oracle file) and emits `report_cje.csv` alongside the existing `report.csv`.
- Since `GenericLLMEvaluator` already outputs per-sample judge scores in the `details` field, this is technically feasible as a post-processing step.

**Option B — evaluator wrapper (deeper integration):**
- a new evaluator/wrapper that runs:
  - `oracle_evaluator` on a sampled calibration slice
  - `GenericLLMEvaluator` (or existing subjective judge evaluator) on all samples
  - then computes calibrated estimate + uncertainty for the run

### Where it fits in OpenCompass today

- You already highlight that subjective eval is expensive and that `JudgeLLM` is used "as a substitute for human assessors" — CJE is specifically about making that substitution *statistically safe* (and telling you when it's failing).
- You already have a `CascadeEvaluator` pattern for blending rule-based + LLM evaluation. CJE would be a natural extension: use the "rule-based (or human) slice" to *calibrate/validate* the LLM-judge, not just override it.
- You already have excellent bootstrap uncertainty for rankings. CJE would complement this by adding judge calibration diagnostics.

This would be especially valuable for users benchmarking multiple models on subjective tasks (e.g., creative writing, helpfulness) where human eval is the gold standard but too expensive to run comprehensively.

### Happy to contribute

If this direction is interesting, I'm happy to:
- propose an initial interface (config fields + output schema), and/or
- open a PR for a post-processing utility (Option A) as a first step.

If Option A looks good, I could have a draft PR within 2 weeks.

Questions for maintainers:
1) Would you prefer a **post-processing tool** first, or a **core evaluator wrapper**?
2) For subjective eval Compare Mode: do you already store per-question judge decisions in a stable JSON format that we can ingest?

(For reference: Semantic Scholar / GitHub search in this repo didn't turn up an existing issue about "calibration / directional surrogacy" for LLM-judge reporting.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Add Judge Calibration Diagnostics for LLM-as-Judge Evaluation #2392

Feature Request: Add Judge Calibration Diagnostics for LLM-as-Judge Evaluation

Why this matters

Concrete proposal (minimal + optional)

Where it fits in OpenCompass today

Happy to contribute

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: Add Judge Calibration Diagnostics for LLM-as-Judge Evaluation #2392

Description

Feature Request: Add Judge Calibration Diagnostics for LLM-as-Judge Evaluation

Why this matters

Concrete proposal (minimal + optional)

Where it fits in OpenCompass today

Happy to contribute

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions