Support multi-LLM judging for more robust evaluation

### Problem
Relying on a single LLM judge can introduce model-specific biases and variability in scoring which may affect the reliability and robustness of benchmark results.

### Reference
This paper shows using a panel of llms for judging outperforms a single llm as a judge. https://arxiv.org/abs/2404.18796

### Solution
Add support for multiple LLM judges such as Deepseek, Qwen, Gemma, Granite, and aggregate their score with a majority vote if the score is binary or an average if the score is continuous. Report agreement metric such as Cohen’s Kappa or Fleiss’ Kappa to give a measure of confidence.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support multi-LLM judging for more robust evaluation #281

Problem

Reference

Solution

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Support multi-LLM judging for more robust evaluation #281

Description

Problem

Reference

Solution

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions