Skip to content

Support multi-LLM judging for more robust evaluation #281

@AstroBoy1

Description

@AstroBoy1

Problem

Relying on a single LLM judge can introduce model-specific biases and variability in scoring which may affect the reliability and robustness of benchmark results.

Reference

This paper shows using a panel of llms for judging outperforms a single llm as a judge. https://arxiv.org/abs/2404.18796

Solution

Add support for multiple LLM judges such as Deepseek, Qwen, Gemma, Granite, and aggregate their score with a majority vote if the score is binary or an average if the score is continuous. Report agreement metric such as Cohen’s Kappa or Fleiss’ Kappa to give a measure of confidence.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions